Question

我目前正致力于语音识别深度学习项目。
我需要使用移位声音文件来扩充我当前的数据或拉伸它但问题是在增强过程中形状正在发生变化

  y, sr = librosa.load(os.path.join(train_data_path, label, fname))
  librosa.output.write_wav('./input/train_test2/'+label+'/10000'+fname  ,y,sr)

虽然我没有改变任何东西，但它改变了形状。
假设我原来的形状有（99,81,1），但在我改变它后改为（77,81,1）或其他东西

但问题是当我使用keras进行分类时

inp = Input(shape=input_shape)
norm_inp = BatchNormalization()(inp)
img_1 = Convolution2D(8, kernel_size=2, activation=activations.relu)(norm_inp)
img_1 = Convolution2D(8, kernel_size=2, activation=activations.relu)(img_1)
img_1 = MaxPooling2D(pool_size=(2, 2))(img_1)
img_1 = Dropout(rate=0.2)(img_1)

不同的input_shape不适用于keras。在修改wav文件后，我甚至不确定是否可以保留原始形状

是否可以保持原始形状？
如果不可能，可以将其更改为原始文件形状吗？
你建议的其他解决方案是什么？

=========================================== 我执行log_spectrogram后

def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

这个np.log（spec.T.astype（np.float32）+ eps）的形状不同

=============================================== ============== 原始文件

sample_rate, samples = wavfile.read('./input/train/audio/eight/012c8314_nohash_1.wav')
print(sample_rate , sample_rate_test)
new_sample_rate = 8000
resampled = signal.resample(samples, int(new_sample_rate / sample_rate * samples.shape[0]))
print(resampled2.shape)
_, _, specgram = log_specgram(resampled, sample_rate=new_sample_rate)
print("specgramshape->", specgram.shape)
S = librosa.feature.melspectrogram(y =samples, sr =sample_rate, n_mels=128, fmax = 8000  )    
print("S->", S.shape)

librosa.display.specshow(librosa.power_to_db(S, ref=np.max), y_axis = 'mel' , fmax = 8000, x_axis='time')


16000 22050
(5804,)
specgramshape-> (99, 81)
S-> (128, 32)

=============================================== ================

使用后

y = librosa.resample(y,sr,16000)
librosa.output.write_wav('./input/train_test/'+label+'/10000'+fname  ,y,sr)

(16000,) (16000,)
(5804,)
specgramshape-> (71, 81)
S-> (128, 32)

=============================================== ======================

librosa声音wav文件形状将原始形状更改为随机形状

0 个答案: