[00:03] *** dashcloud has quit IRC (Read error: Operation timed out) [00:06] *** dashcloud has joined #archiveteam-bs [00:31] *** _Crocatow has quit IRC (Read error: Connection reset by peer) [00:31] *** _Crocatow has joined #archiveteam-bs [00:41] *** JesseW has joined #archiveteam-bs [00:56] *** fmope has quit IRC (Remote host closed the connection) [00:56] *** fmope has joined #archiveteam-bs [00:59] *** Stiletto has quit IRC (Read error: Operation timed out) [02:12] *** ranma has joined #archiveteam-bs [02:15] the IA makes a LITTLE bit of an effort to backup zips and exes etc on websites, right? [02:15] or not usually [02:15] ? [02:17] I don't think IA's general crawls make a distinction between such files any any other ones. [02:18] What's the URL of the fork on github? [02:18] https://github.com/OldSparkyMI/minishowcase [02:18] *** Stiletto has joined #archiveteam-bs [02:19] *** VADemon has quit IRC (Quit: left4dead) [02:19] ranma: probably worth grabbing a copy of that yourself, if you haven't already [02:19] yep. and threw it on google drive [02:20] I've just put the zip, https://github.com/OldSparkyMI/minishowcase/archive/master.zip into #archivebot, too [02:20] so the thing is pretty well saved at this point [02:22] i'm assuming thisamericanlife, radiolab, other npr podcasts are backed up? [02:24] ranma: check with godane [02:24] what shows up on Wayback? [02:25] checking [02:26] i have this american life on my drive [02:26] i have not uploaded it yet [02:27] https://web.archive.org/web/20160127230614/http://www.thisamericanlife.org/radio-archives/episode/549/amateur-hour [02:27] download link doesn't work [02:28] not butthurt, just more curious as to the archive bot-thingie's capabilities [02:34] Huh, it looks like they used to have direct download links but now redirect you to buy episodes from iTunes and Amazon [02:34] E.g. the download link for this episode works (though it was also archived through ArchiveBot and not the IA's normal crawler) https://web.archive.org/web/20140929042520/http://www.thisamericanlife.org/radio-archives/episode/536/the-secret-recordings-of-carmen-segarra [02:36] Hmm... it looks like episodes are only available for direct download for a short time before the downloads are paywalled [02:37] E.g. last week's episode has a direct download http://www.thisamericanlife.org/radio-archives/episode/585/in-defense-of-ignorance [02:37] But this one from January is pay to download http://www.thisamericanlife.org/radio-archives/episode/577/something-only-i-can-see [02:38] If it doesn't get crawled while it's a direct download neither our bot nor the IA's can probably extract the audio from the stream [02:38] there is a m3u8 file that you can use [02:39] It looks like youtube-dl can also scrape their streams [02:39] (This American Life at least) [03:28] *** bwn_ has joined #archiveteam-bs [03:29] *** Medowar has quit IRC (Quit: Connection closed for inactivity) [03:34] *** bwn has quit IRC (Read error: Operation timed out) [03:39] *** bwn_ has quit IRC (Ping timeout: 633 seconds) [03:47] *** JesseW has quit IRC (Ping timeout: 370 seconds) [03:52] *** bwn has joined #archiveteam-bs [03:54] *** Mayonaise has quit IRC (Read error: Operation timed out) [03:56] *** beardicus has quit IRC (Read error: Operation timed out) [03:57] *** chazchaz has quit IRC (Read error: Operation timed out) [03:57] *** chazchaz has joined #archiveteam-bs [03:58] *** RichardG has quit IRC (Ping timeout: 272 seconds) [04:00] *** RichardG has joined #archiveteam-bs [04:00] *** Frogging has quit IRC (Read error: Operation timed out) [04:03] *** achip has quit IRC (Ping timeout: 258 seconds) [04:04] *** bwn has quit IRC (Ping timeout: 258 seconds) [04:05] *** wyatt8740 has quit IRC (Read error: Operation timed out) [04:05] *** godane has quit IRC (Ping timeout: 258 seconds) [04:05] *** Kaz has quit IRC (Read error: Operation timed out) [04:05] *** Infreq has quit IRC (Ping timeout: 258 seconds) [04:05] *** logchfoo1 has quit IRC (Ping timeout: 258 seconds) [04:10] *** logchfoo4 starts logging #archiveteam-bs at Fri Apr 29 04:10:22 2016 [04:10] *** logchfoo4 has joined #archiveteam-bs [04:10] *** fie_ has quit IRC (Read error: Connection reset by peer) [04:10] *** dashcloud has quit IRC (Read error: Operation timed out) [04:11] *** balrog has joined #archiveteam-bs [04:11] *** swebb sets mode: +o balrog [04:11] *** ring has joined #archiveteam-bs [04:12] *** achip has joined #archiveteam-bs [04:13] *** mr-b has joined #archiveteam-bs [04:13] *** zenguy has joined #archiveteam-bs [04:15] *** acridAxid has joined #archiveteam-bs [04:18] *** joepie91 has quit IRC (Read error: Operation timed out) [04:20] *** joepie91 has joined #archiveteam-bs [04:21] *** wyatt8740 has joined #archiveteam-bs [04:22] *** godane has joined #archiveteam-bs [04:22] *** Kaz has joined #archiveteam-bs [04:24] *** dashcloud has joined #archiveteam-bs [04:45] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:52] *** Sk1d has joined #archiveteam-bs [04:58] *** bwn has joined #archiveteam-bs [05:08] *** bwn has quit IRC (Quit: Quit) [05:20] *** beardicus has joined #archiveteam-bs [05:23] *** Honno has joined #archiveteam-bs [05:37] SketchCow: i'm grabbing 2600 off the wall wusb [05:46] *** Mayonaise has joined #archiveteam-bs [05:52] godane: That sentence could make much less sense if we didn't know the context. [05:55] http://www.2600.com/otw-broadband.xml [05:59] godane: speaking of loveline -- 22:53 last night of loveline tonight [06:12] so i maybe able to get Bill Orelly Radio program [06:31] *** bwn has joined #archiveteam-bs [06:49] *** bwn has quit IRC (Read error: Operation timed out) [06:57] we are also getting a brute force xml of billoreilly.com [07:03] *** Honno has quit IRC (Read error: Operation timed out) [07:23] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:59] *** bwn has joined #archiveteam-bs [08:06] *** schbirid has joined #archiveteam-bs [08:18] *** Medowar has joined #archiveteam-bs [08:35] *** metalcamp has joined #archiveteam-bs [10:06] *** SketchCo1 has joined #archiveteam-bs [10:06] *** swebb sets mode: +o SketchCo1 [10:07] *** SketchCow has quit IRC (Read error: Connection reset by peer) [10:08] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [10:08] *** BnA-Rob1n has quit IRC (Ping timeout: 244 seconds) [10:08] *** joepie91 has quit IRC (Ping timeout: 244 seconds) [10:08] *** SN4T14 has quit IRC (Ping timeout: 244 seconds) [10:08] *** zerkalo has quit IRC (Ping timeout: 244 seconds) [10:09] *** BnA-Rob1n has joined #archiveteam-bs [10:10] *** zerkalo has joined #archiveteam-bs [10:12] *** joepie91 has joined #archiveteam-bs [10:12] *** SN4T14 has joined #archiveteam-bs [10:53] *** BlueMaxim has quit IRC (Quit: Leaving) [12:59] *** weslord has joined #archiveteam-bs [13:19] i'm grabbing the old but official Classic Love Line mp3s from 1996 [13:32] *** weslord has quit IRC (Quit: Lost terminal) [13:43] *** VADemon has joined #archiveteam-bs [14:08] http://www.bloomberg.com/news/articles/2016-04-29/unmasking-the-men-behind-zero-hedge-wall-street-s-renegade-blog [14:17] *** Start has quit IRC (Quit: Disconnected.) [14:21] *** bwn_ has joined #archiveteam-bs [14:34] *** bwn has quit IRC (Read error: Operation timed out) [15:01] *** Honno has joined #archiveteam-bs [15:16] *** Start has joined #archiveteam-bs [15:18] *** SketchCo1 is now known as SketchCow [15:31] *** Yoshimura has joined #archiveteam-bs [15:31] yipdw: Hey. Was there anything wrong with the pipeline? [15:43] *** JesseW has joined #archiveteam-bs [15:46] Ha [16:00] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:06] *** Start has quit IRC (Quit: Disconnected.) [16:07] *** dashcloud has quit IRC (Read error: Operation timed out) [16:14] *** dashcloud has joined #archiveteam-bs [16:47] *** dashcloud has quit IRC (Read error: Operation timed out) [16:54] *** dashcloud has joined #archiveteam-bs [17:11] *** metalcamp has joined #archiveteam-bs [17:27] SketchCow: The Savage Nation is getting uploaded: https://archive.org/details/godaneinbox?and[]=subject%3A%22The+Savage+Nation%22&sort=-publicdate [17:32] *** yakfish has joined #archiveteam-bs [17:32] *** matthusby has joined #archiveteam-bs [17:34] *** SadDM has joined #archiveteam-bs [17:34] *** swebb sets mode: +o SadDM [17:37] *** jspiros has joined #archiveteam-bs [17:41] SketchCow: just found out that Love Line end there run last month [17:41] Yep [17:42] I figured that's what inspired you! [17:42] i didn't even know [17:43] the list is a incomplete one but its based on the official flashplayer date xml pages [17:43] close to 3000 mp3s [18:19] *** Start has joined #archiveteam-bs [18:43] *** remsen has quit IRC (ircd.choopa.net irc2.choopa.net) [18:43] *** remsen1 has joined #archiveteam-bs [18:50] *** bwn_ has quit IRC (Read error: Operation timed out) [19:10] *** bwn_ has joined #archiveteam-bs [19:33] going to attempt to setup httrack, lol [19:44] *** Start has quit IRC (Quit: Disconnected.) [19:52] atrocity: https://www.youtube.com/watch?v=PEZWYXPvmS8 [19:55] so i can point firefox to it to tell it to archive a site for me [19:55] and it seems like it was the eternal debate of it vs. wget for windows, lol [20:25] atrocity: httrack sucks, at least every time I tried, stopped trying past years. [20:25] Wget is consisntent and works well. [20:28] *** remsen1 has quit IRC (ZNC 1.6.2 - http://znc.in) [20:28] *** remsen has joined #archiveteam-bs [20:32] hmm, kk [20:48] *** tomwsmf-a has joined #archiveteam-bs [20:53] Aaaand I've just run into the issue of Cloudflare not letting wget access anything backed by their CDN [20:54] So far goes the consistency. But does anyone have an easy solution for this maybe? [20:54] try and set a common user-agent and headers and try again [20:56] I did that, but it's not enough. Wget needs to load a URL that is set as a "Refresh" HTTP header and wait prior or after downloading the set URL [20:56] I haven't had any trouble using wpull to scrape sites I know use CloudFlare as their CDN [20:56] e.g. "Refresh: 8;URL=/cdn-cgi/l/chk_jschl?pass=1461960032.464-KSKrRC0DC7" [20:56] Have you considered using wget-lua with a custom script to get through that check? [20:58] Hm, that's a better idea than what I was going to pull of, MrRadar [21:32] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [21:42] wpull doesn't work, because it doesn't do what a browser would do [21:44] The sites I've scraped must not have had their bot-protection turned up [22:25] *** Start has joined #archiveteam-bs [22:57] *** Honno has quit IRC (Read error: Operation timed out) [23:29] *** schbirid has quit IRC (Remote host closed the connection) [23:44] *** Stiletto has quit IRC ()