Time |
Nickname |
Message |
00:03
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
00:06
🔗
|
|
dashcloud has joined #archiveteam-bs |
00:31
🔗
|
|
_Crocatow has quit IRC (Read error: Connection reset by peer) |
00:31
🔗
|
|
_Crocatow has joined #archiveteam-bs |
00:41
🔗
|
|
JesseW has joined #archiveteam-bs |
00:56
🔗
|
|
fmope has quit IRC (Remote host closed the connection) |
00:56
🔗
|
|
fmope has joined #archiveteam-bs |
00:59
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
02:12
🔗
|
|
ranma has joined #archiveteam-bs |
02:15
🔗
|
ranma |
the IA makes a LITTLE bit of an effort to backup zips and exes etc on websites, right? |
02:15
🔗
|
ranma |
or not usually |
02:15
🔗
|
ranma |
? |
02:17
🔗
|
JesseW |
I don't think IA's general crawls make a distinction between such files any any other ones. |
02:18
🔗
|
JesseW |
What's the URL of the fork on github? |
02:18
🔗
|
ranma |
https://github.com/OldSparkyMI/minishowcase |
02:18
🔗
|
|
Stiletto has joined #archiveteam-bs |
02:19
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
02:19
🔗
|
JesseW |
ranma: probably worth grabbing a copy of that yourself, if you haven't already |
02:19
🔗
|
ranma |
yep. and threw it on google drive |
02:20
🔗
|
JesseW |
I've just put the zip, https://github.com/OldSparkyMI/minishowcase/archive/master.zip into #archivebot, too |
02:20
🔗
|
JesseW |
so the thing is pretty well saved at this point |
02:22
🔗
|
ranma |
i'm assuming thisamericanlife, radiolab, other npr podcasts are backed up? |
02:24
🔗
|
JesseW |
ranma: check with godane |
02:24
🔗
|
JesseW |
what shows up on Wayback? |
02:25
🔗
|
ranma |
checking |
02:26
🔗
|
godane |
i have this american life on my drive |
02:26
🔗
|
godane |
i have not uploaded it yet |
02:27
🔗
|
ranma |
https://web.archive.org/web/20160127230614/http://www.thisamericanlife.org/radio-archives/episode/549/amateur-hour |
02:27
🔗
|
ranma |
download link doesn't work |
02:28
🔗
|
ranma |
not butthurt, just more curious as to the archive bot-thingie's capabilities |
02:34
🔗
|
MrRadar |
Huh, it looks like they used to have direct download links but now redirect you to buy episodes from iTunes and Amazon |
02:34
🔗
|
MrRadar |
E.g. the download link for this episode works (though it was also archived through ArchiveBot and not the IA's normal crawler) https://web.archive.org/web/20140929042520/http://www.thisamericanlife.org/radio-archives/episode/536/the-secret-recordings-of-carmen-segarra |
02:36
🔗
|
MrRadar |
Hmm... it looks like episodes are only available for direct download for a short time before the downloads are paywalled |
02:37
🔗
|
MrRadar |
E.g. last week's episode has a direct download http://www.thisamericanlife.org/radio-archives/episode/585/in-defense-of-ignorance |
02:37
🔗
|
MrRadar |
But this one from January is pay to download http://www.thisamericanlife.org/radio-archives/episode/577/something-only-i-can-see |
02:38
🔗
|
MrRadar |
If it doesn't get crawled while it's a direct download neither our bot nor the IA's can probably extract the audio from the stream |
02:38
🔗
|
godane |
there is a m3u8 file that you can use |
02:39
🔗
|
MrRadar |
It looks like youtube-dl can also scrape their streams |
02:39
🔗
|
MrRadar |
(This American Life at least) |
03:28
🔗
|
|
bwn_ has joined #archiveteam-bs |
03:29
🔗
|
|
Medowar has quit IRC (Quit: Connection closed for inactivity) |
03:34
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
03:39
🔗
|
|
bwn_ has quit IRC (Ping timeout: 633 seconds) |
03:47
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
03:52
🔗
|
|
bwn has joined #archiveteam-bs |
03:54
🔗
|
|
Mayonaise has quit IRC (Read error: Operation timed out) |
03:56
🔗
|
|
beardicus has quit IRC (Read error: Operation timed out) |
03:57
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
03:57
🔗
|
|
chazchaz has joined #archiveteam-bs |
03:58
🔗
|
|
RichardG has quit IRC (Ping timeout: 272 seconds) |
04:00
🔗
|
|
RichardG has joined #archiveteam-bs |
04:00
🔗
|
|
Frogging has quit IRC (Read error: Operation timed out) |
04:03
🔗
|
|
achip has quit IRC (Ping timeout: 258 seconds) |
04:04
🔗
|
|
bwn has quit IRC (Ping timeout: 258 seconds) |
04:05
🔗
|
|
wyatt8740 has quit IRC (Read error: Operation timed out) |
04:05
🔗
|
|
godane has quit IRC (Ping timeout: 258 seconds) |
04:05
🔗
|
|
Kaz has quit IRC (Read error: Operation timed out) |
04:05
🔗
|
|
Infreq has quit IRC (Ping timeout: 258 seconds) |
04:05
🔗
|
|
logchfoo1 has quit IRC (Ping timeout: 258 seconds) |
04:10
🔗
|
|
logchfoo4 starts logging #archiveteam-bs at Fri Apr 29 04:10:22 2016 |
04:10
🔗
|
|
logchfoo4 has joined #archiveteam-bs |
04:10
🔗
|
|
fie_ has quit IRC (Read error: Connection reset by peer) |
04:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
04:11
🔗
|
|
balrog has joined #archiveteam-bs |
04:11
🔗
|
|
swebb sets mode: +o balrog |
04:11
🔗
|
|
ring has joined #archiveteam-bs |
04:12
🔗
|
|
achip has joined #archiveteam-bs |
04:13
🔗
|
|
mr-b has joined #archiveteam-bs |
04:13
🔗
|
|
zenguy has joined #archiveteam-bs |
04:15
🔗
|
|
acridAxid has joined #archiveteam-bs |
04:18
🔗
|
|
joepie91 has quit IRC (Read error: Operation timed out) |
04:20
🔗
|
|
joepie91 has joined #archiveteam-bs |
04:21
🔗
|
|
wyatt8740 has joined #archiveteam-bs |
04:22
🔗
|
|
godane has joined #archiveteam-bs |
04:22
🔗
|
|
Kaz has joined #archiveteam-bs |
04:24
🔗
|
|
dashcloud has joined #archiveteam-bs |
04:45
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:52
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:58
🔗
|
|
bwn has joined #archiveteam-bs |
05:08
🔗
|
|
bwn has quit IRC (Quit: Quit) |
05:20
🔗
|
|
beardicus has joined #archiveteam-bs |
05:23
🔗
|
|
Honno has joined #archiveteam-bs |
05:37
🔗
|
godane |
SketchCow: i'm grabbing 2600 off the wall wusb |
05:46
🔗
|
|
Mayonaise has joined #archiveteam-bs |
05:52
🔗
|
JesseW |
godane: That sentence could make much less sense if we didn't know the context. |
05:55
🔗
|
godane |
http://www.2600.com/otw-broadband.xml |
05:59
🔗
|
xmc |
godane: speaking of loveline -- 22:53 <supersat> last night of loveline tonight |
06:12
🔗
|
godane |
so i maybe able to get Bill Orelly Radio program |
06:31
🔗
|
|
bwn has joined #archiveteam-bs |
06:49
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
06:57
🔗
|
godane |
we are also getting a brute force xml of billoreilly.com |
07:03
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
07:23
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
07:59
🔗
|
|
bwn has joined #archiveteam-bs |
08:06
🔗
|
|
schbirid has joined #archiveteam-bs |
08:18
🔗
|
|
Medowar has joined #archiveteam-bs |
08:35
🔗
|
|
metalcamp has joined #archiveteam-bs |
10:06
🔗
|
|
SketchCo1 has joined #archiveteam-bs |
10:06
🔗
|
|
swebb sets mode: +o SketchCo1 |
10:07
🔗
|
|
SketchCow has quit IRC (Read error: Connection reset by peer) |
10:08
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
10:08
🔗
|
|
BnA-Rob1n has quit IRC (Ping timeout: 244 seconds) |
10:08
🔗
|
|
joepie91 has quit IRC (Ping timeout: 244 seconds) |
10:08
🔗
|
|
SN4T14 has quit IRC (Ping timeout: 244 seconds) |
10:08
🔗
|
|
zerkalo has quit IRC (Ping timeout: 244 seconds) |
10:09
🔗
|
|
BnA-Rob1n has joined #archiveteam-bs |
10:10
🔗
|
|
zerkalo has joined #archiveteam-bs |
10:12
🔗
|
|
joepie91 has joined #archiveteam-bs |
10:12
🔗
|
|
SN4T14 has joined #archiveteam-bs |
10:53
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:59
🔗
|
|
weslord has joined #archiveteam-bs |
13:19
🔗
|
godane |
i'm grabbing the old but official Classic Love Line mp3s from 1996 |
13:32
🔗
|
|
weslord has quit IRC (Quit: Lost terminal) |
13:43
🔗
|
|
VADemon has joined #archiveteam-bs |
14:08
🔗
|
ranma |
http://www.bloomberg.com/news/articles/2016-04-29/unmasking-the-men-behind-zero-hedge-wall-street-s-renegade-blog |
14:17
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
14:21
🔗
|
|
bwn_ has joined #archiveteam-bs |
14:34
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
15:01
🔗
|
|
Honno has joined #archiveteam-bs |
15:16
🔗
|
|
Start has joined #archiveteam-bs |
15:18
🔗
|
|
SketchCo1 is now known as SketchCow |
15:31
🔗
|
|
Yoshimura has joined #archiveteam-bs |
15:31
🔗
|
Yoshimura |
yipdw: Hey. Was there anything wrong with the pipeline? |
15:43
🔗
|
|
JesseW has joined #archiveteam-bs |
15:46
🔗
|
SketchCow |
Ha |
16:00
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
16:06
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
16:07
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
16:14
🔗
|
|
dashcloud has joined #archiveteam-bs |
16:47
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
16:54
🔗
|
|
dashcloud has joined #archiveteam-bs |
17:11
🔗
|
|
metalcamp has joined #archiveteam-bs |
17:27
🔗
|
godane |
SketchCow: The Savage Nation is getting uploaded: https://archive.org/details/godaneinbox?and[]=subject%3A%22The+Savage+Nation%22&sort=-publicdate |
17:32
🔗
|
|
yakfish has joined #archiveteam-bs |
17:32
🔗
|
|
matthusby has joined #archiveteam-bs |
17:34
🔗
|
|
SadDM has joined #archiveteam-bs |
17:34
🔗
|
|
swebb sets mode: +o SadDM |
17:37
🔗
|
|
jspiros has joined #archiveteam-bs |
17:41
🔗
|
godane |
SketchCow: just found out that Love Line end there run last month |
17:41
🔗
|
SketchCow |
Yep |
17:42
🔗
|
SketchCow |
I figured that's what inspired you! |
17:42
🔗
|
godane |
i didn't even know |
17:43
🔗
|
godane |
the list is a incomplete one but its based on the official flashplayer date xml pages |
17:43
🔗
|
godane |
close to 3000 mp3s |
18:19
🔗
|
|
Start has joined #archiveteam-bs |
18:43
🔗
|
|
remsen has quit IRC (ircd.choopa.net irc2.choopa.net) |
18:43
🔗
|
|
remsen1 has joined #archiveteam-bs |
18:50
🔗
|
|
bwn_ has quit IRC (Read error: Operation timed out) |
19:10
🔗
|
|
bwn_ has joined #archiveteam-bs |
19:33
🔗
|
atrocity |
going to attempt to setup httrack, lol |
19:44
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
19:52
🔗
|
phuzion |
atrocity: https://www.youtube.com/watch?v=PEZWYXPvmS8 |
19:55
🔗
|
atrocity |
so i can point firefox to it to tell it to archive a site for me |
19:55
🔗
|
atrocity |
and it seems like it was the eternal debate of it vs. wget for windows, lol |
20:25
🔗
|
Yoshimura |
atrocity: httrack sucks, at least every time I tried, stopped trying past years. |
20:25
🔗
|
Yoshimura |
Wget is consisntent and works well. |
20:28
🔗
|
|
remsen1 has quit IRC (ZNC 1.6.2 - http://znc.in) |
20:28
🔗
|
|
remsen has joined #archiveteam-bs |
20:32
🔗
|
atrocity |
hmm, kk |
20:48
🔗
|
|
tomwsmf-a has joined #archiveteam-bs |
20:53
🔗
|
VADemon |
Aaaand I've just run into the issue of Cloudflare not letting wget access anything backed by their CDN |
20:54
🔗
|
VADemon |
So far goes the consistency. But does anyone have an easy solution for this maybe? |
20:54
🔗
|
r3c0d3x |
try and set a common user-agent and headers and try again |
20:56
🔗
|
VADemon |
I did that, but it's not enough. Wget needs to load a URL that is set as a "Refresh" HTTP header and wait prior or after downloading the set URL |
20:56
🔗
|
MrRadar |
I haven't had any trouble using wpull to scrape sites I know use CloudFlare as their CDN |
20:56
🔗
|
VADemon |
e.g. "Refresh: 8;URL=/cdn-cgi/l/chk_jschl?pass=1461960032.464-KSKrRC0DC7" |
20:56
🔗
|
MrRadar |
Have you considered using wget-lua with a custom script to get through that check? |
20:58
🔗
|
VADemon |
Hm, that's a better idea than what I was going to pull of, MrRadar |
21:32
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
21:42
🔗
|
VADemon |
wpull doesn't work, because it doesn't do what a browser would do |
21:44
🔗
|
MrRadar |
The sites I've scraped must not have had their bot-protection turned up |
22:25
🔗
|
|
Start has joined #archiveteam-bs |
22:57
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
23:29
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
23:44
🔗
|
|
Stiletto has quit IRC () |