Method and system for processing audio communication on network

文档序号：1570572 发布日期：2020-01-24 浏览：27次中文

阅读说明：本技术 处理网络上的音频通信的方法和系统 (Method and system for processing audio communication on network ) 是由熊飞史景慧陈磊任旻彭飞翔于 2017-11-03 设计创作，主要内容包括：一种处理网络上的音频通信的方法,所述方法包括：在第一客户端设备处：接收来自第二客户端设备的第一音频传输,所述第一音频传输以源语言提供,所述源语言不同于与所述第一客户端设备相关联的默认语言；获得所述第一客户端设备的当前用户语言属性,所述当前用户语言属性指示用于所述第一客户端设备处的通信会话的当前语言；如果所述当前用户语言属性表明,当前用于所述第一客户端设备处的所述通信会话的目标语言不同于与所述第一客户端设备相关联的所述默认语言：获得所述第一音频传输从所述源语言到所述目标语言的翻译；以及将所述目标语言的所述第一音频传输的所述翻译呈现给所述第一客户端设备处的用户。(A method of processing audio communications over a network, the method comprising: at a first client device: receiving a first audio transmission from a second client device, the first audio transmission provided in a source language different from a default language associated with the first client device; obtaining a current user language attribute of the first client device, the current user language attribute indicating a current language for a communication session at the first client device; if the current user language attribute indicates that the target language currently used for the communication session at the first client device is different from the default language associated with the first client device: obtaining a translation of the first audio transmission from the source language to the target language; and presenting the translation of the first audio transmission in the target language to a user at the first client device.)

1. A method of processing audio communications over a network, wherein at a first client device, the first client device establishes an audio and/or video communication session with a second client device over the network, during which audio and/or video communication session the method comprising:

receiving a first audio transmission from the second client device, wherein the first audio transmission is provided by the second client device in a source language that is different from a default language associated with the first client device;

obtaining one or more current user language attributes of the first client device, wherein the one or more current user language attributes are indicative of a current language of the audio and/or video communication session at the first client device;

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device;

obtaining a translation of the first audio transmission from the source language to the target language if the target language is different from the default language;

presenting the translation to a user at the first client device.

2. The method of claim 1, wherein the obtaining the one or more current user language attributes of the first client device comprises:

obtaining facial features of the user at the first client device and obtaining geographic location information of the first client device;

determining a target language according to the one or more current user language attributes comprises:

determining the target language from a combination of the facial features and the geographic location information.

3. The method of claim 1, wherein the obtaining the one or more current user language attributes of the first client device comprises:

obtaining audio input received locally at the first client device during the audio and/or video communication session;

determining a target language according to the one or more current user language attributes comprises:

linguistically analyzing the audio input received locally at the first client device to recommend the target language as the current language used at the first client device.

4. The method of claim 1, further comprising:

obtaining sound characteristics of speech in the first audio transmission;

generating a simulated first audio transmission based on the sound characteristics, the simulated first audio transmission including the translation spoken in the target language based on the sound characteristics.

5. The method of claim 4, wherein presenting the translation to a user at the first client device comprises:

presenting the translated text representation to the user at the first client device;

presenting the simulated first audio transmission.

6. The method of claim 1, wherein during the audio and/or video communication session, the method further comprises:

detecting a continuous voice input by a user at the first client device;

tagging a start time of a first continuous speech input as a beginning of a first audio segment detected at the first client device;

detecting a first predefined interruption in the continuous speech input at the first client device;

in response to detecting the first predefined interruption in the continuous speech input, mark a start time of the first predefined interruption as an end of the first audio segment detected at the first client device, wherein the first audio segment is included in a second audio transmission sent to the second client device.

7. The method of claim 6, further comprising:

generating a first audio packet after detecting the first predefined break in the continuous speech input, the first audio packet comprising the first audio segment;

sending the first audio packet to the second client device as a first portion of the second audio transmission;

while generating the first audio packet and transmitting the first audio packet:

continuing to detect the continuous speech input of the user at the first client device, wherein at least a portion of the continuous speech input detected while generating and transmitting the first audio packet is included in the second audio transmission as a second portion of the second audio transmission.

8. The method of claim 7, further comprising:

translating a plurality of audio segments including the first audio segment and a second audio segment into the source language for presentation at the second client device.

9. The method of claim 6, wherein during the audio and/or video communication session, the method further comprises:

identifying a plurality of audio segments in a continuous speech input;

generating a respective audio packet for each of the plurality of audio segments; and

and sequentially sending the corresponding audio packets of the plurality of audio segments to the second client device according to the corresponding start time stamps of the audio segments.

10. The method according to claim 8 or 9, wherein during the audio and/or video communication session, the method further comprises:

continuously capturing video using a camera at the first client device while the continuous voice input is captured at the first client device;

tagging the continuously captured video according to respective start timestamps of the plurality of audio segments, wherein the respective start timestamps are used to synchronize presentation of the video and respective translations of the plurality of audio segments at the second client device.

11. An electronic device acting as a first client device that has established an audio and/or video communication session with a second client device over a network, the electronic device comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for, during the audio and/or video communication session:

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device;

if the target language is different from the default language associated with the first client device,

obtaining a translation of the first audio transmission from the source language to the target language;

presenting the translation to a user at the first client device.

12. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device, cause the device to perform operations comprising, at a first client device establishing an audio and/or video communication session with a second client device over a network, during the audio and/or video communication session:

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device;

obtaining a translation of the first audio transmission from the source language to the target language if the target language is different from the default language associated with the first client device;

presenting the translation to a user at the first client device.

13. An electronic device acting as a first client device having established an audio and/or video communication session with a second client device over a network, wherein during the audio and/or video communication session, the electronic device comprises:

means for receiving a first audio transmission from the second client device, wherein the first audio transmission is provided by the second client device in a source language that is different from a default language associated with the first client device;

means for obtaining one or more current user language attributes of the first client device, wherein the one or more current user language attributes are indicative of a current language of the audio and/or video communication session at the first client device;

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device; means for, if the target language is different from the default language associated with the first client device:

obtaining a translation of the first audio transmission from the source language to the target language;

presenting the translation to a user at the first client device.

14. An information processing apparatus for use in an electronic device, acting as a first client device that has established an audio and/or video communication session with a second client device over a network, the information processing apparatus comprising, during the audio and/or video communication session:

obtaining a translation of the first audio transmission from the source language to the target language;

presenting the translation to a user at the first client device.

15. An electronic device, comprising:

one or more processors;

a memory; and

16. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device with one or more processors, cause the device to perform the method of any of claims 1-10.

17. An electronic device, comprising:

means for performing the method of any one of claims 1 to 10.

18. An information processing apparatus for use in an electronic device with a display and a touch-sensitive surface, comprising:

means for performing the method of any one of claims 1 to 10.

19. A method of processing audio communications over a network, wherein at a server, a first client device establishes an audio and/or video communication session with a second client device over the network, during which audio and/or video communication session the method comprises:

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device;

sending the translation to the first client device, wherein the translation is presented to a user at the first client device.

20. The method of claim 19, wherein obtaining the one or more current user language attributes comprises:

receiving, from the first client device, facial features of the current user and a current geographic location of the first client device;

determining a target language according to the one or more current user language attributes comprises:

determining a relationship between the facial features and the current geographic location;

determining the target language when it is determined that the relationship meets a predefined criterion.

21. The method of claim 19, wherein obtaining the one or more current user language attributes comprises:

receiving, from the first client device, an audio message that has been received locally at the first client device;

determining a target language according to the one or more current user language attributes comprises:

analyzing the language characteristic of the audio message to obtain an analysis result;

and determining the target language according to the analysis result.

22. The method of claim 19, further comprising:

obtaining sound characteristics of speech in the first audio transmission;

23. The method of claim 22, wherein sending the translation to a user at the first client device comprises:

sending the translated text representation to the first client device;

sending the simulated first audio transmission to the first client device.

24. The method of claim 19, wherein receiving the first audio transmission from the second client device comprises:

receiving, from the second client device, a plurality of audio packets of the first audio transmission, wherein the plurality of audio packets have been sequentially transmitted from the second client device according to respective timestamps of the plurality of audio packets, wherein each respective timestamp indicates a start time of a corresponding audio segment identified in the first audio transmission.

25. The method of claim 24, wherein the obtaining a translation of the first audio transmission from the source language to the target language comprises:

sequentially obtaining respective translations of the plurality of audio packets from the source language to the target language according to the respective timestamps of the plurality of audio packets;

the sending the translation to the first client device comprises:

after completing the first translation of at least one of the plurality of audio packets and before completing the translation of at least another one of the plurality of audio packets, sending the first translation to the first client device.

26. The method of claim 24, further comprising:

receiving a first video transmission concurrently with receiving the first audio transmission from the first client device, wherein the first video transmission is tagged with the same set of timestamps as the plurality of audio packets;

sending the first video transmission and the respective translation of the plurality of audio packets in the first audio transmission having the same set of timestamps to the first client device such that the first client device synchronously renders the respective translation of the plurality of audio packets of the first audio transmission and the first video transmission according to the same set of timestamps.

27. An electronic device acting as a server through which a first client device establishes an audio and/or video communication session with a second client device over a network, comprising:

one or more processors;

a memory; and

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device;

sending the translation to the first client device, wherein the translation is presented to a user at the first client device.

28. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device acting as a server, cause the device to perform operations, a first client device establishing an audio and/or video communication session with a second client device over a network through the server, the operations comprising, during the audio and/or video communication session:

determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device;

sending the translation to the first client device, wherein the translation is presented to a user at the first client device.

29. An electronic device acting as a server through which a first client device establishes an audio and/or video communication session with a second client device over a network, the electronic device comprising, during the audio and/or video communication session:

obtaining a translation of the first audio transmission from the source language to the target language;

sending the translation to the first client device, wherein the translation is presented to a user at the first client device.

30. An information processing apparatus for use in an electronic device that acts as a server through which a first client device establishes an audio and/or video communication session with a second client device over a network, the information processing apparatus comprising, during the audio and/or video communication session:

means for determining a target language from the one or more current user language attributes, the target language being a language recommended for the audio and/or video communication session at the current first client device, if the target language is different from the default language associated with the first client device:

obtaining a translation of the first audio transmission from the source language to the target language;

sending the translation to the first client device, wherein the translation is presented to a user at the first client device.

31. An electronic device, comprising:

one or more processors;

a memory; and

32. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device with one or more processors, cause the device to perform the method of any of claims 19-26.

33. An electronic device, comprising:

means for performing the method of any one of claims 19 to 26.

34. An information processing apparatus for use in an electronic device with a display and a touch-sensitive surface, comprising:

means for performing the method of any one of claims 19 to 26.

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method and system for processing audio communications on a network.

Background

The development of internet technology and real-time audio and video communication over networks have made communication between people very convenient. When people communicate using the same language, voice audio captured at both ends of the communication channel can be transmitted in a duplex manner and output at the receiving end without significant delay. However, sometimes people in a communication session use different languages and real-time translation is needed to help people communicate efficiently. Sometimes, the language used is not necessarily pre-specified, and temporarily setting translation preferences can be time consuming and cumbersome, which can negatively impact user experience.

Disclosure of Invention

As discussed in the background, manually setting translation preferences prior to establishing an audio/video communication session may be applicable for some scenarios, but does not address unforeseen translation requirements or ad hoc communication problems with others whose language preferences are unknown. For example, in a group conference scenario or a client support scenario, speakers in the group conference may prefer to use different languages, and these languages may not be known until a communication session is established or until each participant speaks. Furthermore, when different people speak using the same client device at different times during a communication session, the language spoken in the communication may change in a relatively unpredictable manner. Attempting to manually adjust translation preferences (e.g., manually specify source and target languages) after a communication session has begun and after a user begins speaking can result in unnecessary delays and communication interruptions between users. The solution disclosed herein may address the above deficiencies of the current art.

As disclosed herein, in some embodiments, a method of processing audio communications over a network comprises: at a first client device, the first client device having established an audio and/or video communication session with a second client device over the network, during which a first audio transmission from the second client device is received, wherein the first audio transmission is provided by the second client device in a source language that is different from a default language associated with the first client device; obtaining one or more current user language attributes of the first client device, wherein the one or more current user language attributes are indicative of a current language of the audio and/or video communication session at the first client device; determining a target language from the one or more current user language attributes, the target language being a language recommended for use at the current first client device, and obtaining a translation of the first audio transmission in the source language from the source language to the target language if the target language is different from the default language associated with the first client device; and presenting the translation to a user at the first client device.

In some embodiments, a method of processing audio communications over a network comprises: at a server, a first client device establishing an audio and/or video communication session with a second client device over the network, during which a first audio transmission from the second client device is received, wherein the first audio transmission is provided by the second client device in a source language that is different from a default language associated with the first client device; obtaining one or more current user language attributes of the first client device, wherein the one or more current user language attributes are indicative of a current language of the audio and/or video communication session at the first client device; determining a target language according to the one or more current user language attributes, the target language being a language recommended for use at the current first client device, obtaining a translation of the first audio transmission from the source language to the target language if the target language is different from the default language associated with the first client device; and sending the translation to the first client device, wherein the translation is presented to a user at the first client device.

In some embodiments, a first client device handling audio communications over a network, at the first client device, the first client device establishing an audio and/or video communication session with a second client device over the network, during which audio and/or video communication session the first client device comprises a receiving unit, an obtaining unit and a presenting unit:

the receiving unit to receive a first audio transmission from the second client device, wherein the first audio transmission is provided by the second client device in a source language that is different from a default language associated with the first client device;

the obtaining unit is configured to obtain one or more current user language attributes of the first client device, where the one or more current user language attributes are used to indicate a current language of the audio and/or video communication session at the first client device;

the obtaining unit is further configured to determine a target language based on the one or more current user language attributes, the target language being a language recommended for the audio and/or video communication session at the current first client device, obtain a translation of the first audio transmission from the source language to the target language if the target language is different from the default language associated with the first client device; and

the presentation unit is to present the translation to a user at the first client device.

In some embodiments, a server for handling audio communications over a network, by which a first client device establishes an audio and/or video communication session with a second client device over the network, the server comprising, during the audio and/or video communication session, a receiving unit, an obtaining unit and a sending unit:

the obtaining unit is further configured to determine a target language based on the one or more current user language attributes, the target language being a language recommended for use at the current first client device, and obtain a translation of the first audio transmission from the source language to the target language if the target language is different from the default language associated with the first client device;

the sending unit is configured to send the translation to the first client device, where the translation is presented to a user at the first client device.

In accordance with some embodiments, an electronic device includes a display, an optional touch-sensitive surface, optionally one or more sensors to detect intensity of contacts with the touch-sensitive surface, optionally one or more tactile output generators, one or more processors, and memory storing one or more programs; the one or more programs are configured to be executed by the one or more processors and the one or more programs include instructions for performing, or causing the performance of, the operations of any of the methods described herein. In accordance with some embodiments, a computer-readable storage medium has instructions stored therein, which when executed by an electronic device with a display, an optional touch-sensitive surface, optionally one or more sensors to detect intensity of contacts with the touch-sensitive surface, and optionally one or more tactile output generators, cause the device to perform or cause to perform operations of any of the methods described herein. According to some embodiments, a graphical user interface on an electronic device with a display, an optional touch-sensitive surface, an optional one or more sensors to detect intensity of contacts with the touch-sensitive surface, an optional one or more tactile output generators, a memory, and one or more processors to execute one or more programs stored in the memory includes one or more of the elements presented in any of the methods described herein that are updated in response to an input, as described in any of the methods described herein. According to some embodiments, an electronic device comprises: a display, an optional touch-sensitive surface, optionally one or more sensors to detect intensity of contacts with the touch-sensitive surface, and optionally one or more tactile output generators; and means for performing or causing performance of the operations of any of the methods described herein. In accordance with some embodiments, an information processing apparatus for use in an electronic device with a display, an optional touch-sensitive surface, optionally one or more sensors to detect intensity of contacts with the touch-sensitive surface, and optionally one or more tactile output generators, comprises means for performing, or causing to be performed, operations of any of the methods described herein.

In some embodiments, a computing device (e.g., server system 108, 204 of fig. 1, 2; client device 104, 200, 202 of fig. 1 and 2; or a combination of these server systems and client devices) includes one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for performing or controlling the operations of performing any of the methods described herein. In some embodiments, a non-transitory computer readable storage medium stores one or more programs, the one or more programs comprising instructions, which when executed by a computing device (e.g., the server system 108, 204 of fig. 1, 2; the client device 104, 200, 202 of fig. 1 and 2; or a combination of these server systems and client devices) having one or more processors, cause the computing device to perform or control the operations of performing any of the methods described herein. In some embodiments, a computing device (e.g., server system 108, 204 of fig. 1, 2; client device 104, 200, 202 of fig. 1 and 2; or a combination of these server system and client device) includes means for performing or controlling the operations of performing any of the methods described herein.

Various advantages of the present application will be apparent from the following description.

Drawings

The foregoing features and advantages of the disclosed technology, as well as additional features and advantages thereof, will be more clearly understood from the following detailed description of preferred embodiments taken in conjunction with the accompanying drawings.

In order to more clearly describe embodiments of the disclosed technology or technical solutions in the prior art, the drawings required for describing the embodiments or the prior art are briefly introduced below. It is evident that the drawings in the following description only show some embodiments of the disclosed technology and that still further drawings can be derived from these drawings by a person skilled in the art without inventive effort.

FIG. 1 is a block diagram of a server-client environment according to some embodiments.

Fig. 2A-2B are block diagrams illustrating an audio and/or video communication session between a first client device and a second client device established over a network via a server, according to some embodiments.

Fig. 3-5 are communication timing diagrams of interactions between a first client device, a second client device, and a server during an audio and/or video communication session over a network, according to some embodiments.

Fig. 6A-6G illustrate flow diagrams of methods of processing audio communications, according to some embodiments.

Fig. 7A-7F illustrate flow diagrams of methods of processing audio communications, according to some embodiments.

Fig. 8 is a block diagram of a client device according to some embodiments.

Fig. 9 is a block diagram of a server system according to some embodiments.

Like reference numerals designate corresponding parts throughout the several views of the drawings.

Detailed Description

Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that the subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments.

The technical solutions in the embodiments of the present application are clearly and thoroughly described below with reference to the accompanying drawings in the embodiments of the present application. It is clear that the described embodiments are only a part of the embodiments of the present application, not all embodiments. All other embodiments that can be derived based on the embodiments of the present application by a person of ordinary skill in the art without inventive effort are intended to be within the scope of protection of the present application.

As shown in fig. 1, data processing for a real-time audio/video communication platform is implemented in a server-client environment 100, according to some embodiments. According to some embodiments, the server-client environment 100 includes client-side processes 102-1, 102-2, 102-3 (hereinafter "client-side module 102") executing on client devices 104-1, 104-2, 104-3 and server-side processes 106 (hereinafter "server-side module 106") executing on a server system 108. The client-side module 102 communicates with the server-side module 106 over one or more networks 110. The client-side module 102 provides client-side functionality of the social networking platform and communicates with the server-side module 106. The server-side module 106 provides server-side functionality of the social networking platform for any number of client modules 102 that each reside on a respective client device 104.

In some embodiments, the server-side module 106 includes one or more processors 112 (e.g., processor 902 in FIG. 9), a session database 114, a user database 116, I/O interfaces 118 to one or more clients, and I/O interfaces 120 to one or more external services. The I/O interface 118 to one or more clients facilitates client-oriented input and output processing by the server-side module 106. Session database 114 stores preset preferences for communication sessions (e.g., virtual conference rooms) that users have established, and user database 116 stores user profiles for users in the communication platform. The I/O interface 120 to one or more external services facilitates communication with one or more external services 122 (e.g., web servers or cloud-based service providers, such as file sharing and data storage services).

Examples of client devices 104 include, but are not limited to, handheld computers, wearable computing devices, Personal Digital Assistants (PDAs), tablet computers, laptop computers, desktop computers, cellular telephones, smart phones, Enhanced General Packet Radio Service (EGPRS) mobile phones, media players, navigation devices, game consoles, televisions, remote controls, point of sale (POS) terminals, onboard computers, e-book readers, or a combination of any two or more of these or other data processing devices.

Examples of one or more networks 110 include a Local Area Network (LAN) and a Wide Area Network (WAN), such as the internet. Optionally, one or more networks 110 are implemented using any known network Protocol, including various wired or wireless protocols, such as ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), global system for Mobile communications (GSM), Enhanced Data GSM Environment (EDGE), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), bluetooth, Wi-Fi, voice over internet Protocol (VoIP), Wi-MAX, or any other suitable Communication Protocol.

The server system 108 is implemented on one or more stand-alone data processing devices or a distributed computer network. In some embodiments, the server system 108 also provides the underlying computing resources and/or infrastructure resources of the server system 108 using various virtual devices and/or services of a third party service provider (e.g., a third party cloud service provider). In some embodiments, the server system 108 includes, but is not limited to, a handheld computer, a tablet computer, a laptop computer, a desktop computer, or a combination of any two or more of these or other data processing devices.

The server system 108 also implements various modules for supporting real-time audio/video communications, such as communications in an online virtual meeting room by multiple users located at different locations, including an audio/video services module 124, a speech-to-text module 126, and a translation services module 128, among others.

The server-client environment 100 shown in fig. 1 includes a client-side portion (e.g., client-side module 102) and a server-side portion (e.g., server-side module 106). In some embodiments, the data processing is implemented as a standalone application installed on the client device 104. Additionally, the division of functionality between the client portion and the server portion of the client-server environment 100 may vary in different embodiments. For example, in some embodiments, the client-side module 102 is a thin client, providing only user-oriented input and output processing functions, and delegating all other data processing functions to a backend server (e.g., server system 108). While many aspects of the present technology are described from the server perspective, those skilled in the art will appreciate the corresponding actions performed by a client device without any inventive effort. Further, some aspects of the present technology may be performed by a server, a client device, or a server and client cooperation.

Attention is now directed to embodiments of a user interface and associated processes that may be implemented on the client device 104.

Fig. 2A-2B are block diagrams illustrating an audio and/or video communication session between a first client device (e.g., client device a) and a second client device (e.g., client device B) established over a network via a server, according to some embodiments.

As shown in fig. 2A, in some embodiments, a user a of a client device a200 (e.g., client device 104-1 in fig. 1) (e.g., client device a200 may be a smartphone or a computer) needs to converse with a user B of a client device B202 (e.g., client device 104-2) via an audio and/or video communication session. Client device a sends an audio/video communication request (e.g., via client-side module 102-1) to server 204 (e.g., server system 108). In response to receiving the request, the server transmits the request to client device B. When client device B receives the request, a call alert is optionally displayed on client device B (e.g., in a user interface of client-side module 102-2). If user B accepts the call request (e.g., when the "accept" button 206 is selected in the user interface shown on client device B), an indication may be sent to user a that user B has accepted the audio/video communication request sent by client device a. For example, client device B sends an acceptance instruction to the server. When the server receives the acceptance instruction, the server establishes an audio/video communication session that supports audio/video transmission between client device a to client device B. In some embodiments, the server provides services (e.g., audio/video transmission services, speech to text services, translation services, file sharing services, etc.). In some embodiments, if user B rejects the audio/video communication request (e.g., selects "reject" button 207), the server may terminate the request and send a response to client device a indicating that the call request was rejected.

Fig. 2B illustrates an exemplary audio and/or video communication session between client device a and client device B after the communication session is established (e.g., in the manner shown in fig. 2A). Although established by one of the participants (e.g., user a) in the communication session, the example exchange shown in fig. 2B may be in either direction, with the roles of the two devices reversed.

In one example, first, user B communicates with user a via client device B. Client device B receives continuous speech input from user B spoken in the first source language (e.g., user B speaks "How you!" in English) and/or captures an image of user B in front of client device B (e.g., captures an image of user B's face via camera 213 on client device B). Client device B transmits the audio and/or video that has been captured to the server as one or more transmissions (e.g., packets, messages, etc.) or data streams (e.g., shown as first audio/video transmission 215) to client device a. The first audio/video transmission includes a continuous voice input received from user B and a stream of captured images. In some embodiments, the communication session between client devices a and B is only an audio communication session without video communication.

In some embodiments illustrated in fig. 2B, when the server receives the first audio/video transmission sent by the audio/video service module 217 of client device B, the server sends the first video transmission to client device a and sends the first audio transmission to the speech recognition service module 219 (e.g., a speech recognition service module provided by the server or a third party service provider). The speech recognition service module 219 performs speech-to-text processing on the first audio transmission to generate a text string in the source language and sends the text string in the source language to the translation service module 221 (e.g., a translation module provided by a server or a third party service provider). Translation service module 221 translates the text string generated from the first video transmission from the source language (e.g., the source language type received from client device B or otherwise determined by the server) to the text string in the target language (e.g., the target language type received from client device a or otherwise determined by the server), sends a translation of the text string generated from the first audio transmission, and optionally sends the original first audio transmission and the text string in the source language to client device a. In some embodiments, the translation is in textual form. In some embodiments, the translation is adapted to a speech form. In some embodiments, the text form and the speech form are sent together to the client device a. In some embodiments, the original audio of the first audio transmission is removed and replaced with an audio translation. In some embodiments, the text translation is added as closed captioning to the original audio transmission. In some embodiments, the text string of the source language is added as closed captioning. When client device a receives the translation of the first audio transmission and text string in the source language, client device a presents the translation and optionally the text string and audio in the source language to user a (e.g., the translation is displayed on display 208 of client device a) (e.g., shown as the translation "hello" in chinese 223 and the source language "how you" in english 225 displayed on display screen 208).

In some embodiments, when client device a and client device B have established a real-time video communication session over the internet, there is a small display box (e.g., shown as 227) for presenting user a's graphics on client device a and a large display box 229 (e.g., shown as 229) for presenting user B's graphics on client device a. In some embodiments, an indication of the currently identified source language for each device is displayed in association with a display frame of the device. For example, display box 227 at client device a has an indication that chinese is the source language currently used at client device a, and display box 229 at client device a has an indication that english is the source language currently used at client device B.

In some embodiments, a default language is specified at the presence client device a. For example, in some embodiments, the default language of client device a is automatically selected by the server for client device a or is a preferred language that user a has specified in the language settings of client device a before the video communication session begins. For example, user A is a Chinese user, and user A may pre-designate the default language of his device as Chinese in the language setting of client device A. As another example, when setting up an application, the default language is specified by the server. In some embodiments, user a or the server has not set a default language by the time the audio/video communication session is established.

In some embodiments, when the translation service module detects that the source language included in the first audio transmission from client device B is different from the default language of client device a, the translation service module or server obtains one or more user language attributes of user a from client device a (e.g., facial features of the user, geographic location information of client device a, audio messages received locally at client device a, etc.) and translates the speech in the source language to a target language determined from the user language attributes of client device a. The target language is sometimes different from a default language pre-specified for the client device prior to establishing the communication session, and the determination of the target language is collected in real-time based on user language attributes after establishing the communication session or while the communication session is being established. For example, user B using a source language such as English speaks at client device B, the default language of client device A having been designated as Japanese by a previous user input in the settings interface. However, the server, in determining the user language attributes of the client device a, determines that the current user a may not be a speaker or able to understand japanese. Instead, the server determines that the user at client device a is a chinese speaking user and understands chinese. The translation service module then translates the speech received at the client device B using English as the source language and Chinese as the target language.

In some embodiments, client device a obtains one or more current user language attributes of client device a by obtaining facial features of user a and obtaining geographic location information of client device a at client device a. The facial features optionally include ethnic features (e.g., eye color, facial structure, hair color, etc.) that indicate the ethnicity or nationality of user a, or facial features of a user that are used to determine whether user a, currently using client device a, is different from the user who sets the default language for client device a. The geographic location information of the first client device optionally includes a current location of client device a and a historical location for a preset time period prior to the current time, or a pre-stored location of client device a. For example, in some embodiments, client device a captures facial features of user a speaking and/or listening at client device a, obtaining current geographic location information of client device a.

In some embodiments, the facial features of the user at client device a and the geographic location information of client device a are combined to recommend the target language (e.g., chinese) as the current language used by client device a, rather than the default language (e.g., japanese) associated with client device a. For example, client device a determines that the current user a is caucasian based on facial features and determines that the current location of client device a is located in north america. Based on the combination of the race and geographic location information, client device a indicates that the current language used at client device a may be english, and thus uses english as the translation target language for the user currently using client device a. In some embodiments, if the default language of client device a has been specified by a previous user input in the settings interface, client device a requires that at least one of the currently collected facial features and/or geographic location information indicate that the current user is different from the user specifying the default language of client device a. In some embodiments, the translation from the source language to the target language is provided to the user at client device a only after client device a receives confirmation that the recommended target language is a correct recommendation.

In some embodiments, the determination of the target language is performed by a server of the communication session after the client device a collects the facial features and geographic location information and sends the collected information to the server. In some embodiments, the target language is determined based on a data model trained on the server, and then stored at the client device a. In some embodiments, client device a presents a confirmation request from a user at client device a before client device a determines that the target language will replace the default language as the approved current language for use at client device a.

In some embodiments, client device a determines the target language locally without transmitting facial features and geographic location information to the server, which helps to protect user privacy and reduce server load.

In some embodiments, client device a obtains one or more current user language attributes of client device a by obtaining audio input received locally at client device a during an audio and/or video communication session. Audio input received locally at client device a is linguistically analyzed (e.g., using a language model or a speech model to determine the language spoken) to recommend the target language as the current language used at client device a. For example, rather than mistakenly regarding the default language currently associated with client device a as the current language used at client device a, client device a or the server recognizes the language type of the audio input as english and determines that the current language used at client device a is english, and client device a or the server recommends the target language of client device a as english.

Fig. 3 is a communication timing diagram of an interaction over a network by a server between a client device a and a second client device B that establish an audio and/or video communication session. In some embodiments, first, client device a sends an audio/video communication session request to second client device B through a server (e.g., or directly rather than through the server), illustrated by 302 through 304. When client device B accepts the request (shown as 306), the server establishes a connection (shown as 308) for an audio and/or video communication session between client device a and client device B. The steps from 302 to 308 are also described with reference to fig. 2A.

When client device B sends a first audio/video transmission spoken in user B's source language to the server (shown as 310), the server performs speech-to-text recognition on the first audio transmission in the source language to generate a text representation of client device B's source language (shown as 312). Before the server translates the text string in the source language to the text string in the target language, the server determines whether the source language of the client device B is different from the default language or the preferred language that has been automatically selected by the server for the client device a, the preferred language having been specified by the user in the language settings of the client device a before the video communication session begins. If it is determined that the source language is different from the default language of client device A (shown as 314), then the server translates the first audio transmission from the source language to a target language (e.g., as described with reference to FIGS. 2A and 2B) determined from the current user language attributes of client device A (shown as 316). In some embodiments, if the source language of client device B is the same as the default language of client device a, the server will not perform any translation.

After the server completes the translation of the first audio transmission from the source language to the target language, the server sends the first audio transmission in the source language and the translated text representation of the original audio to client device A (shown as 322). Client device a receives the first audio transmission in the source language and the translated textual representation of the original audio and presents the textual representations (e.g., as shown at 223 and 225 in fig. 2B) on the display.

In some embodiments, the server generates a simulated first audio transmission from the sound characteristics of user B, the simulated first audio transmission including a translation of the first audio transmission, and sends the simulated first audio transmission to client device a (shown as 324 and 326). For example, the server obtains the sound characteristics of the speech in the first audio transmission. The sound characteristics optionally include a voiceprint or a set of predefined characteristics, such as the frequency, tone, pitch, duration, amplitude, etc. of a person's voice. In some embodiments, the simulated first audio transmission is generated using a common voice of a male, female, or child, based on the sound characteristics obtained from the first audio transmission indicating whether the original first audio transmission was spoken by the male, female, or child. In some embodiments, the simulated first audio transmission closely mimics the speech of the original first audio transmission.

In some embodiments, the server automatically switches between using generic speech or special simulated speech to speak the translation, depending on the server load (e.g., processing power, memory, and network bandwidth) and the rate at which the audio transmission is received at the client device. For example, when the server load is above a predefined threshold, the simulated first audio transmission is provided in speech generated from a small subset of the sound characteristics of the original first audio transmission (e.g., only dominant frequency and pitch); and provide the simulated first audio transmission with speech generated according to a larger subset of the acoustic characteristics of the original first audio transmission (e.g., a wider range of frequencies, pitches, amplitudes, etc.) when the server load is below a predefined threshold.

In some embodiments, after receiving the simulated first audio transmission, client device a presents the translated text representation on a display and outputs the simulated first audio transmission for user a at client device a. For example, a simulated first audio transmission in the target language is played at client device A in place of the original first audio transmission in the source language. In some embodiments, the playback of the segment of the video transmission at client device a is delayed such that the video transmission received from the second client device is synchronized with the playback of the simulated first audio transmission at client device a.

Fig. 4 is an example of processing at client device B when sending an audio transmission to client device a.

In some embodiments, when user B speaks at client device B (e.g., as shown in fig. 2B), client device B detects a continuous speech input by user B at client device B and marks a first start time of the continuous speech input (e.g., start time st1 in fig. 4) as the beginning of the first audio segment detected at client device B. When client device B detects a first predefined interruption in the continuous speech input (e.g., interruption bk1 in fig. 4), the client device marks a first end time of the first predefined interruption bk1 (e.g., end time et1 in fig. 4) as the end of the first audio segment. In some embodiments, continuous speech input is defined as a continuous speech input stream, comprising only brief interruptions that are less than a predefined speech input termination time threshold. When the voice input is not detected beyond the voice input termination time threshold, the continuous voice input is considered to be terminated. The speech input termination time threshold is longer than a predefined time threshold for recognizing a break in the continuous speech input; and the time threshold for detecting a break in the continuous speech input is longer than the estimated natural pause between words in the sentence or between two clauses in the sentence.

In some embodiments, after detecting the first predefined break bk1 in the continuous speech input, client B converts the first audio segment into a first audio packet and sends the first audio packet to the server (shown as 412). The server then performs speech-to-text recognition on the first audio packet and translates the first audio segment from the source language to the target language (shown as 414). The server sends the translation of the first audio segment to client device a for rendering the translation at client device a (shown as 416 and 418). In some embodiments, the audio packets are segments of an audio input stream that are encoded and compressed according to a predefined format, such as a Roshal ARchive (RAR) file.

In some embodiments, client device B continues to detect continuous speech input by the user located at client device B while the first audio packet is generated and sent (at step 412). At least a portion of the continuous speech input detected while generating and transmitting the first audio packet is included in the first audio transmission as a second portion of the first audio transmission. For example, while continuing to detect the continuous speech input, client device B detects a second predefined break in the continuous speech input at client device B (e.g., break bk2 at the end of the second audio segment 2 in fig. 4). The client device B marks the end time of the first predefined interruption bk1 as a second start time of the second audio segment (e.g., start time st2 of segment 2 in fig. 4), and marks the second end time of the second predefined interruption (e.g., end time et2 of segment 2 in fig. 4) as the end of the second audio segment detected at the client device B. Client device B generates a second audio packet to include the second audio segment and transmits the second audio packet to client device a (e.g., as shown at 422 through 428 in fig. 4).

The above process continues as long as termination of the continuous speech input has not been detected, and further audio segments are detected in the continuous speech input, each tagged with a respective start time stamp (and optionally a respective end time stamp), converted into a respective audio packet, and sequentially transmitted to the client device a according to the respective start time stamp of the audio segment. Thus, two or more audio segments, including a first audio segment and a second audio segment, are translated from the source language of the first audio transmission to the target language determined for client device a for presentation at client device a. For example, the first audio transmission includes one or more sentences received in separate audio packets that each arrive at the server and have different headers and timestamps, each sentence is translated from the source language of the first audio transmission into the target language determined for client device a, respectively, and the translation is presented at client device a.

In some embodiments, when continuous speech input is captured at client device B, client device B continuously captures video using a camera at client device B and tags the continuously captured video with respective start timestamps (and optionally, respective end timestamps) for two or more audio segments, wherein client device a (or the server) synchronizes presentation of the video and the respective translations of the two or more audio segments at client device a using the respective start timestamps (and optionally, the respective end timestamps).

Fig. 5 is a timing diagram of example server-side processing during an audio and/or video communication session over a network between client device a and client device B. In some embodiments, the server 204 includes an audio/video server 502, a speech recognition server 504, and a translation server 506. In some embodiments, the servers 502, 504, and 506 are replaced by sub-modules of the server 204 that provide the described functionality.

In some embodiments, during an audio and/or video communication session, the audio/video server receives a first audio/video transmission in the source language spoken by user B from client device B (shown as 511) and sends the first audio transmission to the speech recognition server (shown as 513). The speech recognition server recognizes the first audio transmission and generates a text representation in the source language (shown as 515) from a speech recognition library or language model stored at the speech recognition server and sends the text representation in the source language to a translation server (shown as 517) in preparation for translating the text representation from the source language to a target language that has been determined for client device a. The translation server then sends a target language request to the audio/video server (shown as 519) for determining whether translation transmission is required and, if so, what target language should be translated (e.g., determining whether the source language is the same as the target language or a default language).

The audio/video server determines the user language attributes from client device a and recommends the target language as the current language used at client device a (shown as 521). In some embodiments, the audio/video server receives facial features of the current user at client device a and the current geographic location of client device a, and determines a relationship between the facial features of the current user and the current geographic location of client device a (e.g., whether the facial features indicate a race or nationality sufficiently related (e.g., according to predefined criteria) to the current geographic location of the first client device). When it is determined that the relationship meets a predefined criterion, the audio/video server recommends the target language. For example, if the facial features and geographic location are both related to the same language, then the language is recommended as the target language. In some embodiments, the audio/video server receives an audio message that has been received locally at client device a (e.g., a verbal instruction from a user at client device a or a voice input received from user a as part of an audio/video communication session) and analyzes the linguistic characteristics of the audio message. The audio/video server may then recommend a target language for use by the translation server based on the analysis of the language characteristics of the audio message.

The audio/video server sends the recommended target language to the translation server (shown as 523). The translation server then translates the first audio transmission from the source language to the target language recommended by the audio/video server and sends the translation of the first audio transmission to client device a to present the translation results at client device a (e.g., providing a textual representation and an aural representation of the translation at the first client device).

In some embodiments, the speech recognition server identifies the sound characteristics of the speech in the first audio transmission (shown as 531) and generates a simulated first audio transmission (shown as 533) and a translation of the first audio transmission from the sound characteristics of the speech in the first audio transmission. The sound characteristics may include a voiceprint or a set of predefined characteristics, such as the frequency, tone, pitch, duration, amplitude, etc. of a person's voice. After the simulated first audio transmission generated by the speech recognition server, the speech recognition server sends the simulated first audio transmission to client device a (shown as 535) for rendering a translation of the simulated first audio transmission.

Fig. 6A-6G illustrate a flow diagram of a method 600 of providing an audio communication session between two or more client devices. In some embodiments, method 600 is performed by the first client device in conjunction with the server or independent of the server. For example, in some embodiments, the method 600 is performed by the client device 104-1 (e.g., the client-side module 102-1) in conjunction with the server system 108 (fig. 1-2) or a server system component (e.g., the server-side module 106, fig. 1-2). In some embodiments, method 600 is governed by instructions stored in a non-transitory computer readable storage medium and executed by one or more processors of a client and server system. Optional operations are indicated by dashed lines (e.g., boxes with dashed borders).

In the method 600 of handling audio communications over a network, a first client device has established an audio and/or video communication session with a second client device over the network (e.g., a user of the first device and a user of the second device have established a real-time video conference over the internet through a server of an online teleconferencing service provider). During an audio and/or video communication session: the first client device receives (602), from a second client device, a first audio transmission (e.g., by a server of a video conference service), wherein the first audio transmission is provided by the second client device in a source language that is different from a default language associated with the first client device (e.g., a default language automatically selected by the server for the first client device, or a preferred language that the user has specified in a language setting of the first client device prior to the start of a video communication session). The first client device obtains (604) one or more current user language attributes (e.g., facial features of the user, geographic location information, locally received audio messages, etc.) of the first client device, wherein the one or more current user language attributes indicate a current language for an audio and/or video communication session at the first client device. When it is determined that the one or more current user language attributes recommend a target language currently used for the audio and/or video communication session at the first client device, and it is determined that the target language is different from a default language associated with the first client device (606): obtaining (608) a translation of a first audio transmission in a source language from the source language to a target language by a first client device; and the first client device presents (610) a translation of the first audio transmission in the target language to a user at the first client device. For example, in some embodiments, the target language is recommended by the first client device. In some embodiments, the target language is recommended by a server. In some embodiments, the first client device determines whether the target language is the same as a default language associated with the first client device. In some embodiments, the server makes a determination as to whether the target language is the same as a default language associated with the first client device. In some embodiments, the translation of the first audio transmission in the target language is presented as a text output at the first client device. In some embodiments, the translation of the first audio transmission is provided as an audio output at the first client device. In some embodiments, the translated textual representation and the audible representation are provided at the first client device (e.g., synchronized with a portion of the video corresponding to the first audio transmission).

In some embodiments, obtaining one or more current user language attributes of the first client device (e.g., step 604) includes (612) obtaining facial features of the user at the first client device (e.g., obtaining ethnic features (e.g., eye color, facial structure, hair color, etc.) that indicate the ethnicity or nationality of the user, or obtaining facial features of the user used to determine whether the user currently using the first client device is different from the user setting the default language of the first client device), and obtaining geographic location information of the first client device (e.g., a historical location including the current location of the first client device and a preset time period, or a pre-stored location of the first client device). The facial features of the user at the first client device and the geographic location information of the first client device are combined (614) to recommend the target language as the current language used by the first client device, rather than the default language associated with the first client device. For example, in some embodiments, a first client device captures facial features of a current user speaking and/or listening at the first client device, obtaining current geographic location information of the first client device. In some embodiments, the first client device determines that the current user is likely a caucasian based on the facial features and determines that the current location of the first client device is located in north america. Based on the combination of the ethnicity and the geographic location information, the first client device indicates that the current language used at the first client device may be english. In some embodiments, if the default language of the first client device has been specified by a previous user input in the settings interface, the first client device requires that at least one of the currently collected facial features and/or geographic location information indicate that the current user is different from the user specifying the default language of the first client device. In some embodiments, the determination of the target language is performed by a server of the communication session after the first client device collects the facial features and the geographic location information and sends the collected information to the server. In some embodiments, the first client device determines the target language locally without transmitting facial features and geographic location information to the server, which helps to protect user privacy and reduce server load. In some embodiments, a target language is determined based on a data model trained on a server, and then the target language is stored at a first client device. In some embodiments, the first client device presents a request for confirmation from a user at the first client device before the first client device determines that the target language will replace the default language as the approved current language for use at the first client device. In some embodiments, the translation from the source language to the target language is provided to the user at the first client device only after the first client device receives confirmation that the recommended target language is a correct recommendation.

In some embodiments, obtaining one or more current user language attributes of the first client device (e.g., step 604) includes (616) obtaining audio input received locally at the first client device during the audio and/or video communication session. Audio input received locally at the first client device is linguistically analyzed (618) (e.g., using a language model or a speech model to determine a spoken language) to recommend a target language as a current language for use at the first client device. For example, in some embodiments, the first client device or server identifies the language type of the audio input as english and determines that the current language used at the first client device is english, and the first client device or server recommends that the target language of the first client device be english, rather than mistakenly regarding the default language currently associated with the first client device as the current language used at the first client device.

In some embodiments, the first client device obtains (622) sound characteristics (e.g., a voiceprint or a set of predefined characteristics, such as frequency, tone, pitch, duration, amplitude, etc., of a human voice) of the voice in the first audio transmission; and the first client device generates (624) a simulated first audio transmission based on the sound characteristics of the speech in the first audio transmission, the simulated first audio transmission including a translation of the first audio transmission spoken in the target language based on the sound characteristics of the speech of the first audio transmission. For example, in some embodiments, the simulated first audio transmission is generated using a universal voice of a male, female, or child, based on sound characteristics obtained from the first audio transmission indicating whether the original first audio transmission was spoken by the male, female, or child. In some embodiments, the simulated first audio transmission closely mimics the speech of the original first audio transmission. In some embodiments, a system (e.g., a server) automatically switches between using generic speech or special simulated speech to speak a translation depending on server load (e.g., processing power, memory, and network bandwidth) and the rate at which audio transmissions are received at the first client device. For example, when the server load is above a predefined threshold, the simulated first audio transmission is provided in speech generated from a small subset of the sound characteristics of the original first audio transmission; and when the server load is below a predefined threshold, the simulated first audio transmission is provided with speech generated from a larger subset of the sound characteristics of the original first audio transmission.

In some embodiments, presenting the translation of the first audio transmission in the target language to the user at the first client device (e.g., step 610) comprises: presenting (626) the translated text representation of the first audio transmission in the target language to a user at the first client device; and presenting (628) a simulated first audio transmission generated from the acoustic characteristics of the speech in the first audio transmission (e.g., playing the simulated first audio transmission in the target language at the first client device in place of the original first audio transmission in the source language). In some embodiments, playback of the segment of the video transmission at the first client device is delayed such that the video transmission received from the second client device is synchronized with playback of the simulated first audio transmission at the first client device.

In some embodiments, during an audio and/or video communication session: the first client device detects (632) a continuous speech input (e.g., the continuous speech input is defined as a continuous speech input stream that includes only brief interruptions that are less than a predefined speech input termination time threshold). When the voice input is not detected beyond the voice input termination time threshold, the continuous voice input is considered to be terminated. The speech input termination time threshold is longer than a predefined time threshold for recognizing a break in the continuous speech input; and the time threshold for detecting a break in the continuous speech input is longer than the estimated natural pause between words of a sentence or between two clauses of a sentence. ). The first client device time stamps (634) a start of the first continuous speech input to a beginning of the first audio segment detected at the first client device. The first client device detects (636) a first predefined break in continuous speech input at the first client device (e.g., detects that there is not a sufficient amount of speech input in a continuous audio input stream at the first client device for at least a threshold amount of time). In response to detecting a first predefined interruption in the continuous speech input, the first client device marks a start time of the first predefined interruption as an end of a first audio segment detected at the first client device, wherein the first audio segment is included in a second audio transmission sent to a second client device.

In some embodiments, after detecting the first predefined break in the continuous speech input, the first client device generates (642) a first audio packet comprising a first audio segment. The first client device sends (644) the first audio packet to the second client device as a first portion of a second audio transmission. While generating the first audio packet and transmitting the first audio packet: the first client device continues (646) to detect continuous speech input by a user located at the first client device, wherein at least a portion of the continuous speech input detected while generating and transmitting the first audio packets is included in the second audio transmission as a second portion of the second audio transmission. For example, while continuing to detect the continuous speech input, the first client device detects a second predefined interruption in the continuous speech input at the first client device. The first client device marks an end time of the first predefined break as a start time of the second segment and marks a start time of the second predefined break as an end of the second audio segment detected at the first client device. The first client device generates a second audio packet to include the second audio segment and transmits the second audio packet to the second client device. The above process continues as long as termination of continuous speech input has not been detected, and more audio segments are detected, converted to audio packets, and sent to the second client device. In some embodiments, two or more audio segments including a first audio segment and a second audio segment are translated into a source language of a first audio transmission for presentation at a second client device. For example, the second audio transmission includes one or more sentences received in separate audio packets that each arrive at the server and have different headers and timestamps, each sentence is translated into the source language of the first audio transmission, respectively, and the translation is presented at the second client device.

In some embodiments, during an audio and/or video communication session: the first client device identifying (648) two or more audio segments in a continuous speech input (e.g., a continuous speech input stream) at the first client device, each audio segment tagged with a respective start timestamp (and optionally a respective end timestamp); the first client device generates (650) a respective audio packet for each of the two or more audio segments (e.g., an audio packet is a segment of an audio input stream encoded and compressed according to a predefined format, such as a RAR file); and the first client device sequentially transmits respective audio packets of the two or more audio segments to the second client device (e.g., by the server or directly) according to the respective start timestamps of the audio segments. In some embodiments, the audio packet is sent to a server responsible for translating the audio segment, but not to the second client device. In some embodiments, the transmission of the audio packets (e.g., as separate and discrete files) is independent of the transmission of audio that is continuously captured at the first client device (e.g., by continuous streaming).

In some embodiments, during an audio and/or video communication session: continuously capturing (656) video by the first client device using a camera at the first client device while the continuous speech input is captured at the first client device; and the first client device marks the continuously captured video with respective start timestamps (and optionally respective end timestamps) for the two or more audio segments, wherein the second client device (or server) uses the respective start timestamps (and optionally the respective end timestamps) to synchronize the presentation of the video and the respective translations of the two or more audio segments at the second client device.

It should be understood that the particular order in which operations are described in fig. 6A-6G is merely exemplary and is not intended to indicate that the order described is the only order in which operations may be performed. One of ordinary skill in the art will recognize various ways to reorder the operations described herein. Additionally, it should be noted that other processes described herein with reference to other methods and/or details of the processes described herein can also be applied in a manner similar to the method 600 described above.

Fig. 7A-7F illustrate a flow diagram of a method 700 of providing an audio communication session between two or more client devices. In some embodiments, method 600 is performed by a server in conjunction with two or more client devices. For example, in some embodiments, the method 600 is performed by the server 108 in conjunction with the client devices 104-1 and 104-2 or client device components (e.g., the client-side module 102, fig. 1-2). In some embodiments, method 700 is governed by instructions stored in a non-transitory computer readable storage medium and executed by one or more processors of a client and server system. Optional operations are indicated by dashed lines (e.g., boxes with dashed borders).

Through the server, the first client device has established an audio and/or video communication session with the second client device over the network (e.g., a user of the first device and a user of the second device have established a real-time video conference over the internet through a server of an online teleconferencing service provider). During an audio and/or video communication session: the server receives (702) a first audio transmission from a second client device, wherein the first audio transmission is provided by the second client device in a source language different from a default language associated with the first client device (e.g., a default language automatically selected by the server for the first client device, or a preferred language that the user has specified in a language setting of the first client device prior to the start of the video communication session). The server obtains (e.g., from the first client device, and/or optionally another server) one or more current user language attributes of the first client device (e.g., facial features of the user at the first client device, geographic location information (e.g., current location and/or recent location), an audio message received locally at the first client device, etc.), wherein the one or more current user language attributes indicate a current language for the audio and/or video communication session at the first client device. When it is determined that the one or more current user language attributes recommend a target language currently used for the audio and/or video communication session at the first client device, and it is determined that the target language is different from a default language associated with the first client device (706): the server obtains (708) a translation of the first audio transmission from the source language to the target language; and the server sends (710) a translation of the first audio transmission in the target language to the first client device, wherein the translation is presented to the user at the first client device. For example, in some embodiments, the target language is recommended by the first client device. In some embodiments, the target language is recommended by a server. In some embodiments, the first client device determines whether the target language is the same as a default language associated with the first client device. In some embodiments, the server makes a determination as to whether the target language is the same as a default language associated with the first client device. In some embodiments, the translation of the first audio transmission in the target language is presented as a text output at the first client device. In some embodiments, the translation of the first audio transmission is provided as an audio output at the first client device. In some embodiments, the translated text representation and the audible representation are provided at the first client device (e.g., synchronized with a portion of the video corresponding to the first audio transmission, text mode, or audio mode).

In some embodiments, obtaining one or more current user language attributes and recommending a target language currently used at the first client device for the audio and/or video communication session (e.g., step 704) further comprises: receiving (712), from the first client device, facial features of a current user and a current geographic location of the first client device; determining (714) a relationship between facial features of the current user and the current geographic location of the first client device (e.g., whether the facial features indicate a race or nationality sufficiently related (e.g., according to predefined criteria) to the current geographic location of the first client device); and recommending (716) the target language when the relationship is determined to meet the predefined criteria (e.g., recommending the language as the target language if the facial features and geographic location are both related to the same language in some embodiments).

In some embodiments, the server obtains (732) sound characteristics (e.g., voiceprint or a set of predefined characteristics, such as frequency, pitch, duration, amplitude, etc., of the human voice) of the voice in the first audio transmission; and the server generates (734) a simulated first audio transmission based on the sound characteristics of the speech in the first audio transmission, the simulated first audio transmission including a translation of the first audio transmission spoken in the target language based on the sound characteristics of the speech of the first audio transmission. In some embodiments, sending the translation of the first audio transmission to the target language of the user at the first client device to the first client device (e.g., step 710) comprises: sending (736) a translated text representation of a first audio transmission to a user at a first client device in a target language to the first client device; and sending (738) a simulated first audio transmission generated from the acoustic characteristics of the speech in the first audio transmission to the first client device (e.g., sending the simulated first audio transmission in the target language to the first client device in place of the original first audio transmission in the source language). In some embodiments, the transmission of the segment of the video transmission to the first client device is delayed such that the video transmission to the first client device is synchronized with the transmission of the simulated first audio transmission to the first client device.

In some embodiments, receiving a first audio transmission from a second client device (e.g., step 702) further comprises: two or more audio packets of a first audio transmission are received (742) from a second client device, wherein the two or more audio packets have been sequentially sent from the second client device according to respective timestamps of the two or more audio packets, and wherein each respective timestamp indicates a start time of a corresponding audio segment identified in the first audio transmission. In some embodiments, the server may receive two or more audio packets out of order, and the server rearranges the audio packets according to the timestamps. In some embodiments, the server does not order the received packets based on their respective timestamps, but rather, after the translations for at least two of the audio segments have been obtained, the server orders the translations for the audio segments in the two or more audio packets based only on their respective timestamps. In some embodiments, obtaining a translation of the first audio transmission from the source language to the target language and sending the translation of the first audio transmission in the target language to the first client device (e.g., steps 708 and 710) further comprises: sequentially obtaining (744) respective translations of the two or more audio packets from the source language to the target language based on the respective timestamps of the two or more audio packets; and sending (746) the first translation to the first client device after the first translation of at least one of the two or more audio packets is completed and before the translation of at least another of the two or more audio packets is completed.

In some embodiments, the server receives (748) a first video transmission concurrently with receiving a first audio transmission from a first client device, wherein the first video transmission is tagged with the same set of timestamps as two or more audio packets; and sending (750) the respective translations of the two or more audio packets in the first video transmission and the first audio transmission having the same set of timestamps to the first client device, such that the first client device synchronously renders the respective translations of the two or more audio packets of the first audio transmission and the first video transmission according to the same set of timestamps. In some embodiments, the server receives the continuous audio stream and the continuous video stream from the first client device, for example, over a dedicated network connection for an audio and/or video communication session. The server extracts audio segments from the continuous audio stream one by one (e.g., based on detection of a predefined break in continuous speech input embodied in the continuous audio stream). For example, the server generates an audio packet for each recognized audio segment and sends the audio packet (e.g., as opposed to a continuous audio stream) to the translation server or server side translation module when the end of the audio segment is detected, while the server continues to receive the audio and video streams. In some embodiments, the server transmits the video stream to the second client device as a continuous video stream and transmits the translation of the audio packets to the second client device as audio and text data packets, wherein the second client device synchronizes the presentation of the video and the translation of the audio packets. In some embodiments, the server inserts a translation of the audio packet at an appropriate location of the video stream and sends the video stream with the embedded translation to the first client device.

It should be understood that the particular order in which operations are described in fig. 7A-7F is merely exemplary and is not intended to indicate that the order described is the only order in which operations may be performed. One of ordinary skill in the art will recognize various ways to reorder the operations described herein. Additionally, it should be noted that other processes described herein with reference to other methods and/or details of the processes described herein can also be applied in a manner similar to the method 600 described above.

FIG. 8 is a block diagram illustrating a representative client device 104 associated with a user, in accordance with some embodiments. The client device 104 typically includes one or more processing units (CPUs) 802, one or more network interfaces 804, memory 806, and one or more communication buses 808 for interconnecting these components (sometimes referred to as a chipset). The client device 104 also includes a user interface 810. User interface 810 includes one or more output devices 812, including one or more speakers and/or one or more visual displays, that enable presentation of media content. The user interface 810 also includes one or more input devices 814, including user interface components to facilitate user input, such as a keyboard, a mouse, a voice command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture-capture camera, or other input buttons or controls. In addition, some client devices 104 use a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 also includes sensors that provide contextual information about the current state of the client device 104 or environmental conditions associated with the client device 104. Sensors include, but are not limited to, one or more microphones, one or more cameras, an ambient light sensor, one or more accelerometers, one or more gyroscopes, a GPS positioning system, a bluetooth or BLE system, a temperature sensor, one or more motion sensors, one or more biosensors (e.g., skin resistance sensor, pulse oximetry, etc.), and other sensors. The memory 806 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM or other random access solid state memory devices; and optionally, non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid-state storage devices. Memory 806 optionally includes one or more storage devices located remotely from the one or more processing units 802. Memory 806, or alternatively, non-volatile memory within memory 806 includes non-transitory computer-readable storage media. In some embodiments, memory 806 or the non-transitory computer readable storage medium of memory 806 stores the following programs, modules, and data structures, or a subset or superset of the programs, modules, and data structures:

● an operating system 816 including programs for handling various basic system services and for performing hardware dependent tasks;

● network communication module 818 for connecting client device 104 to other computing devices (e.g., server system 108) connected to one or more networks 110 via one or more network interfaces 804 (wired or wireless);

● a rendering module 820 for enabling rendering of information (e.g., user interfaces for applications or social networking platforms, gadgets, websites, and website webpages, and/or games, audio and/or video content, text, etc.) at the client device 104 via one or more output devices 812 (e.g., displays, speakers, etc.) associated with the user interface 810;

● input processing module 822 for detecting one or more user inputs or interactions from one or more input devices 814 and interpreting the detected inputs or interactions;

●, which are executed by the client device 104 (e.g., games, application markets, payment platforms, and/or other network or non-network based applications);

● client side module 102, which provides client side data processing and functionality for real-time audio/video communications, including but not limited to:

○ a data transfer module 826 for transferring audio/video/text data to and from the server and other client devices;

○ translation module 828 for translating audio or text from one language to another;

○ a speech recognition module 830 for performing speech-to-text conversion on the speech audio input;

○ a rendering module 832 for rendering the original audio/video and/or translation in audio and/or textual form;

○ determining module 834 for determining a target language and determining whether the target language of the client device is the same as a default language set for the client device;

○ an obtaining module 836 for obtaining current language attributes of the client device, and

○ other module 838 for performing other functions set forth herein.

Each of the above identified elements may be stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing the functions described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules, or data structures, and thus various subsets of these modules may be combined or otherwise rearranged in various embodiments. In some embodiments, memory 806 optionally stores a subset of the modules and data structures identified above. Further, memory 806 optionally stores additional modules and data structures not described above.

Fig. 9 is a block diagram illustrating a server system 108 according to some embodiments. The server system 108 typically includes one or more processing units (CPUs) 902, one or more network interfaces 904 (e.g., including I/O interfaces to one or more clients 114 and I/O interfaces to one or more external services), memory 906, and one or more communication buses 908 for interconnecting these components (sometimes referred to as a chipset). Server 108 also optionally includes a user interface 910. The user interface 910 includes one or more output devices 912 that enable presentation of information and one or more input devices 914 that enable user input. Memory 906 comprises high speed random access memory, such as DRAM, SRAM, ddr ram or other random access solid state memory devices; and optionally, non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid-state storage devices. The memory 906 optionally includes one or more storage devices located remotely from the one or more processing units 902. Memory 906, or alternatively, non-volatile memory within memory 906, includes non-transitory computer-readable storage media. In some embodiments, memory 906 or a non-transitory computer readable storage medium of memory 906 stores the following programs, modules, and data structures, or a subset or superset of the programs, modules, and data structures:

● operating system 916 including programs for handling various basic system services and for performing hardware dependent tasks;

● a network communication module 918 for connecting the server system 108 to other computing devices (e.g., client devices 104 and external services) (wired or wireless) connected to one or more networks 110 via one or more network interfaces 904;

● a presentation module 920 for enabling presentation of information;

● input processing module 922 for detecting one or more user inputs or interactions from one or more input devices 814 and interpreting the detected inputs or interactions;

● one or more server applications 924 for managing server operations;

● server-side module 106 that provides server-side data processing and functionality for facilitating audio/video communications between client devices, including but not limited to:

○ data transfer module 926 for transferring audio/video/text data to and from the server and other client devices;

○ a translation module 928 for translating audio or text from one language to another;

○ a speech recognition module 930 for performing speech-to-text conversion on the speech audio input;

○ obtaining module 932 for obtaining current language attributes of the client device;

○ a determination module 934 for determining a target language and determining whether the target language of the client device is the same as a default language set for the client device;

○ an audio/video processing module 936 for processing the input streams for audio processing and video processing, respectively, and

○ or other modules 938 for performing other functions set forth herein.

Each of the above identified elements may be stored in one or more of the previously mentioned memory devices and correspond to a set of instructions for performing the functions described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise rearranged in various embodiments. In some embodiments, memory 906 optionally stores a subset of the modules and data structures identified above. Further, memory 906 optionally stores additional modules and data structures not described above.

In some embodiments, at least some of the functions of server system 108 are performed by client device 104, and corresponding sub-modules of these functions may be located within client device 104 rather than server system 108. In some embodiments, at least some of the functions of the client device 104 are performed by the server system 108, and corresponding sub-modules of these functions may be located within the server system 108 rather than the client device 104. The client device 104 and server system 108 shown in fig. 1-5, respectively, are merely illustrative, and in various embodiments, different configurations of modules for implementing the functionality described herein are possible.

While specific embodiments are described above, it should be understood that it is not intended to limit the application to these specific embodiments. On the contrary, the application includes alternatives, modifications, and equivalents as may be included within the spirit and scope of the appended claims. Numerous specific details are set forth in order to provide a thorough understanding of the subject matter presented herein. It will be apparent, however, to one of ordinary skill in the art that the subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments.

45页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于创建者提供的内容来定制交互式对话应用

Method and system for processing audio communication on network

相关技术

网友询问留言