[00:04] *** Ravenloft has joined #archiveteam-bs [00:04] https://youtu.be/3Or4nWQpidk?t=600 [00:17] i'm at 963k items now [00:31] *** jrwr has quit IRC (Remote host closed the connection) [00:33] *** jrwr has joined #archiveteam-bs [00:49] *** powerKitt has joined #archiveteam-bs [00:59] *** powerKitt has quit IRC (Quit: Page closed) [01:45] *** kristian_ has quit IRC (Quit: Leaving) [02:01] *** ravetcofx has quit IRC (Ping timeout: 506 seconds) [02:10] *** ravetcofx has joined #archiveteam-bs [02:25] *** jrwr has quit IRC (Remote host closed the connection) [02:52] *** ndiddy has quit IRC (Quit: Leaving) [02:53] *** BlueMaxim has quit IRC (Read error: Operation timed out) [02:54] *** BlueMaxim has joined #archiveteam-bs [03:37] *** Ravenloft has quit IRC (Ping timeout: 633 seconds) [03:54] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [05:12] *** balrog has quit IRC (Ping timeout: 260 seconds) [05:21] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:27] *** Sk1d has joined #archiveteam-bs [05:52] *** Start has quit IRC (Quit: Disconnected.) [05:55] *** Start has joined #archiveteam-bs [06:15] *** balrog has joined #archiveteam-bs [06:15] *** swebb sets mode: +o balrog [08:00] *** Observer has quit IRC (Ping timeout: 268 seconds) [08:12] *** GE has joined #archiveteam-bs [10:05] *** ravetcofx has quit IRC (Ping timeout: 506 seconds) [10:19] *** BlueMaxim has quit IRC (Quit: Leaving) [10:36] damn godane [10:37] when do you think you will hit 1m? [10:43] *** antomati_ has joined #archiveteam-bs [10:43] *** swebb sets mode: +o antomati_ [10:49] *** antomatic has quit IRC (Read error: Operation timed out) [11:48] *** GE has quit IRC (Remote host closed the connection) [12:25] *** Budgiebra has joined #archiveteam-bs [12:25] *** Budgiebra has left [13:27] *** GE has joined #archiveteam-bs [13:51] *** sep332 has joined #archiveteam-bs [14:25] *** ndiddy has joined #archiveteam-bs [14:26] *** ndizzle has joined #archiveteam-bs [14:26] *** ndizzle has quit IRC (Read error: Connection reset by peer) [15:41] *** powerKitt has joined #archiveteam-bs [15:42] It looks like the Wayback machine will outright refuse to import a WARC at all if it contains pages set as off-limits in robots.txt [15:44] Example: https://archive.org/details/archiveteam_archivebot_go_falconk_sbarg_boards_net_20161015 https://web-beta.archive.org/web/*/sbarg.boards.net/* [15:45] Because the archivebot capture has pages set as off-limits in robotstxt, all the wayback machine imported from that full site warc was robots.txt [15:51] not sure what's going on there [15:51] the warc has the url http://sbarg.boards.net/board/1/general-announcements - that isn't blocked by robots.txt but isn't in the wayback [15:53] Yeah, it's kind of bizarre huh? [15:53] it's not in a web collection [15:53] so it just takes longer to get into the wayback [15:53] that's from my experience [15:53] ah, I assumed it only needed to be mediatype web [15:53] that experience in some months old though [15:53] Oh, okay. [15:53] yeah, afai it only needs web, if it isn't in a web collection it just takes longer to import [15:54] (that might have change though) [15:54] but check back in a few weeks or so [15:54] I wish I could just add items to the web collection. [15:55] what items do you have? [15:56] https://archive.org/details/powerKitten-WARC here's one of them, I'm not sure if it's setup properly though. [15:57] My other items with WARCs in them are https://archive.org/details/SBARG-WARC-Scrapes , https://archive.org/details/Tumblr-WARC-SBARG and https://archive.org/details/sbarg [16:01] ( I know the two neopets pages I archived won't ever show up in the wayback, since www.neopets.com is just generally excluded from it. ) [16:07] So, I need to make sure my WARC items have their mediatype set to "web" right arkiver? [16:08] *** RichardG has joined #archiveteam-bs [16:22] Anyway, I was also going to ask if [16:23] 1) there was a script to crawl a MediaWiki (or MediaWiki dump) for outgoing links and compile them into a list. [16:24] ie: a txt file [16:28] powerKitt: check http://archiveteam.org/index.php?title=Site_exploration#MediaWiki_wikis [16:28] it may not work for your case [16:28] but it also may in which case it's perfect [16:28] *** powerKitt has quit IRC (Ping timeout: 268 seconds) [17:19] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [17:19] *** BartoCH has joined #archiveteam-bs [17:20] *** kristian_ has joined #archiveteam-bs [17:27] *** JW_work has joined #archiveteam-bs [17:39] *** powerKitt has joined #archiveteam-bs [18:00] *** Aranje has joined #archiveteam-bs [18:09] *** bwn has quit IRC (Read error: Operation timed out) [18:30] *** bwn has joined #archiveteam-bs [18:40] *** bwn has quit IRC (Ping timeout: 244 seconds) [18:44] Is there a good strategy for archiving a flash-based website? Like, http://pokemondiamondandpearl.nintendo-europe.com/ for example. [18:46] Because all I can really think of is going through and finding all the swfs that make the site, and reverse engineering them for asset links, and then feeding that massive pile of links to WGET with WARC enabled. [18:47] *** bwn has joined #archiveteam-bs [18:47] powerKitt: if it needs none external resources just get the swf. [19:05] *** ravetcofx has joined #archiveteam-bs [19:06] *** bsmith093 has quit IRC (Read error: Operation timed out) [19:48] *** bsmith093 has joined #archiveteam-bs [20:53] I see that this new batch of MSDN CDs is 142gb of material [20:53] Damn son [20:58] http://us.xploder.net/nintendo-ds/ Ugh, archiving this site will be hell. It appears like they've copied over the contents of a now dead http://updates.xploder.net/ but haven't changed any of the links, which are all hardcoded to point at things on http://updates.xploder.net/ [20:59] oh fun [21:00] The best idea I have is to try and mirror it to a WARC with Wget, and use some OS level thing to make http://updates.xploder.net/ point to the ip of http://us.xploder.net/ [21:00] https://en.wikipedia.org/wiki/Hosts_(file) ie, this [21:00] yea that [21:01] I'm worried that'll fuck up the WARC generation, though. [21:02] it is interfering with the natural order of things [21:02] but if it's necessary, [21:05] Once I get the warc, what's the proper way to upload it to http://www.archive.org/ again? [21:05] make an item in community texts, upload the warc into the item, set mediatype web if you can [21:05] then get someone's attention here so it can be moved and blessed correctly, things that you can't do without the right admin privileges [21:06] http://archive.org/upload/ [21:06] Alright. [21:12] *** Start has quit IRC (Read error: Connection reset by peer) [21:13] *** Start has joined #archiveteam-bs [21:56] *** BlueMaxim has joined #archiveteam-bs [22:29] *** RichardG_ has joined #archiveteam-bs [22:29] *** RichardG has quit IRC (Ping timeout: 370 seconds) [22:54] *** powerKitt has quit IRC (Ping timeout: 268 seconds) [23:16] wat [23:16] CenturyLink to Buy Level 3 for $34B http://hn.premii.com/#/article/12835826 [23:16] (hackernews) [23:17] ickā€¦ [23:18] yeah :x [23:19] hm [23:20] ... just yet another big consolidation, predicted many many years ago, the trend, that is. [23:27] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [23:27] *** RichardG has joined #archiveteam-bs [23:48] i'm uploading www.abc.net.au/news/2013 urls