#archiveteam-bs 2016-10-31,Mon

↑back Search

Time Nickname Message
00:04 πŸ”— Ravenloft has joined #archiveteam-bs
00:04 πŸ”— Ravenloft https://youtu.be/3Or4nWQpidk?t=600
00:17 πŸ”— godane i'm at 963k items now
00:31 πŸ”— jrwr has quit IRC (Remote host closed the connection)
00:33 πŸ”— jrwr has joined #archiveteam-bs
00:49 πŸ”— powerKitt has joined #archiveteam-bs
00:59 πŸ”— powerKitt has quit IRC (Quit: Page closed)
01:45 πŸ”— kristian_ has quit IRC (Quit: Leaving)
02:01 πŸ”— ravetcofx has quit IRC (Ping timeout: 506 seconds)
02:10 πŸ”— ravetcofx has joined #archiveteam-bs
02:25 πŸ”— jrwr has quit IRC (Remote host closed the connection)
02:52 πŸ”— ndiddy has quit IRC (Quit: Leaving)
02:53 πŸ”— BlueMaxim has quit IRC (Read error: Operation timed out)
02:54 πŸ”— BlueMaxim has joined #archiveteam-bs
03:37 πŸ”— Ravenloft has quit IRC (Ping timeout: 633 seconds)
03:54 πŸ”— GLaDOS has quit IRC (Quit: Oh crap, I died.)
05:12 πŸ”— balrog has quit IRC (Ping timeout: 260 seconds)
05:21 πŸ”— Sk1d has quit IRC (Ping timeout: 194 seconds)
05:27 πŸ”— Sk1d has joined #archiveteam-bs
05:52 πŸ”— Start has quit IRC (Quit: Disconnected.)
05:55 πŸ”— Start has joined #archiveteam-bs
06:15 πŸ”— balrog has joined #archiveteam-bs
06:15 πŸ”— swebb sets mode: +o balrog
08:00 πŸ”— Observer has quit IRC (Ping timeout: 268 seconds)
08:12 πŸ”— GE has joined #archiveteam-bs
10:05 πŸ”— ravetcofx has quit IRC (Ping timeout: 506 seconds)
10:19 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)
10:36 πŸ”— Midas damn godane
10:37 πŸ”— Midas when do you think you will hit 1m?
10:43 πŸ”— antomati_ has joined #archiveteam-bs
10:43 πŸ”— swebb sets mode: +o antomati_
10:49 πŸ”— antomatic has quit IRC (Read error: Operation timed out)
11:48 πŸ”— GE has quit IRC (Remote host closed the connection)
12:25 πŸ”— Budgiebra has joined #archiveteam-bs
12:25 πŸ”— Budgiebra has left
13:27 πŸ”— GE has joined #archiveteam-bs
13:51 πŸ”— sep332 has joined #archiveteam-bs
14:25 πŸ”— ndiddy has joined #archiveteam-bs
14:26 πŸ”— ndizzle has joined #archiveteam-bs
14:26 πŸ”— ndizzle has quit IRC (Read error: Connection reset by peer)
15:41 πŸ”— powerKitt has joined #archiveteam-bs
15:42 πŸ”— powerKitt It looks like the Wayback machine will outright refuse to import a WARC at all if it contains pages set as off-limits in robots.txt
15:44 πŸ”— powerKitt Example: https://archive.org/details/archiveteam_archivebot_go_falconk_sbarg_boards_net_20161015 https://web-beta.archive.org/web/*/sbarg.boards.net/*
15:45 πŸ”— powerKitt Because the archivebot capture has pages set as off-limits in robotstxt, all the wayback machine imported from that full site warc was robots.txt
15:51 πŸ”— Kaz not sure what's going on there
15:51 πŸ”— Kaz the warc has the url http://sbarg.boards.net/board/1/general-announcements - that isn't blocked by robots.txt but isn't in the wayback
15:53 πŸ”— powerKitt Yeah, it's kind of bizarre huh?
15:53 πŸ”— arkiver it's not in a web collection
15:53 πŸ”— arkiver so it just takes longer to get into the wayback
15:53 πŸ”— arkiver that's from my experience
15:53 πŸ”— Kaz ah, I assumed it only needed to be mediatype web
15:53 πŸ”— arkiver that experience in some months old though
15:53 πŸ”— powerKitt Oh, okay.
15:53 πŸ”— arkiver yeah, afai it only needs web, if it isn't in a web collection it just takes longer to import
15:54 πŸ”— arkiver (that might have change though)
15:54 πŸ”— arkiver but check back in a few weeks or so
15:54 πŸ”— powerKitt I wish I could just add items to the web collection.
15:55 πŸ”— arkiver what items do you have?
15:56 πŸ”— powerKitt https://archive.org/details/powerKitten-WARC here's one of them, I'm not sure if it's setup properly though.
15:57 πŸ”— powerKitt My other items with WARCs in them are https://archive.org/details/SBARG-WARC-Scrapes , https://archive.org/details/Tumblr-WARC-SBARG and https://archive.org/details/sbarg
16:01 πŸ”— powerKitt ( I know the two neopets pages I archived won't ever show up in the wayback, since www.neopets.com is just generally excluded from it. )
16:07 πŸ”— powerKitt So, I need to make sure my WARC items have their mediatype set to "web" right arkiver?
16:08 πŸ”— RichardG has joined #archiveteam-bs
16:22 πŸ”— powerKitt Anyway, I was also going to ask if
16:23 πŸ”— powerKitt 1) there was a script to crawl a MediaWiki (or MediaWiki dump) for outgoing links and compile them into a list.
16:24 πŸ”— powerKitt ie: a txt file
16:28 πŸ”— Sanqui powerKitt: check http://archiveteam.org/index.php?title=Site_exploration#MediaWiki_wikis
16:28 πŸ”— Sanqui it may not work for your case
16:28 πŸ”— Sanqui but it also may in which case it's perfect
16:28 πŸ”— powerKitt has quit IRC (Ping timeout: 268 seconds)
17:19 πŸ”— BartoCH has quit IRC (Ping timeout: 260 seconds)
17:19 πŸ”— BartoCH has joined #archiveteam-bs
17:20 πŸ”— kristian_ has joined #archiveteam-bs
17:27 πŸ”— JW_work has joined #archiveteam-bs
17:39 πŸ”— powerKitt has joined #archiveteam-bs
18:00 πŸ”— Aranje has joined #archiveteam-bs
18:09 πŸ”— bwn has quit IRC (Read error: Operation timed out)
18:30 πŸ”— bwn has joined #archiveteam-bs
18:40 πŸ”— bwn has quit IRC (Ping timeout: 244 seconds)
18:44 πŸ”— powerKitt Is there a good strategy for archiving a flash-based website? Like, http://pokemondiamondandpearl.nintendo-europe.com/ for example.
18:46 πŸ”— powerKitt Because all I can really think of is going through and finding all the swfs that make the site, and reverse engineering them for asset links, and then feeding that massive pile of links to WGET with WARC enabled.
18:47 πŸ”— bwn has joined #archiveteam-bs
18:47 πŸ”— Yoshimura powerKitt: if it needs none external resources just get the swf.
19:05 πŸ”— ravetcofx has joined #archiveteam-bs
19:06 πŸ”— bsmith093 has quit IRC (Read error: Operation timed out)
19:48 πŸ”— bsmith093 has joined #archiveteam-bs
20:53 πŸ”— SketchCow I see that this new batch of MSDN CDs is 142gb of material
20:53 πŸ”— SketchCow Damn son
20:58 πŸ”— powerKitt http://us.xploder.net/nintendo-ds/ Ugh, archiving this site will be hell. It appears like they've copied over the contents of a now dead http://updates.xploder.net/ but haven't changed any of the links, which are all hardcoded to point at things on http://updates.xploder.net/
20:59 πŸ”— xmc oh fun
21:00 πŸ”— powerKitt The best idea I have is to try and mirror it to a WARC with Wget, and use some OS level thing to make http://updates.xploder.net/ point to the ip of http://us.xploder.net/
21:00 πŸ”— powerKitt https://en.wikipedia.org/wiki/Hosts_(file) ie, this
21:00 πŸ”— xmc yea that
21:01 πŸ”— powerKitt I'm worried that'll fuck up the WARC generation, though.
21:02 πŸ”— xmc it is interfering with the natural order of things
21:02 πŸ”— xmc but if it's necessary,
21:05 πŸ”— powerKitt Once I get the warc, what's the proper way to upload it to http://www.archive.org/ again?
21:05 πŸ”— xmc make an item in community texts, upload the warc into the item, set mediatype web if you can
21:05 πŸ”— xmc then get someone's attention here so it can be moved and blessed correctly, things that you can't do without the right admin privileges
21:06 πŸ”— xmc http://archive.org/upload/
21:06 πŸ”— powerKitt Alright.
21:12 πŸ”— Start has quit IRC (Read error: Connection reset by peer)
21:13 πŸ”— Start has joined #archiveteam-bs
21:56 πŸ”— BlueMaxim has joined #archiveteam-bs
22:29 πŸ”— RichardG_ has joined #archiveteam-bs
22:29 πŸ”— RichardG has quit IRC (Ping timeout: 370 seconds)
22:54 πŸ”— powerKitt has quit IRC (Ping timeout: 268 seconds)
23:16 πŸ”— ranma wat
23:16 πŸ”— ranma CenturyLink to Buy Level 3 for $34B http://hn.premii.com/#/article/12835826
23:16 πŸ”— ranma (hackernews)
23:17 πŸ”— JW_work ick…
23:18 πŸ”— ranma yeah :x
23:19 πŸ”— Yoshimura hm
23:20 πŸ”— Yoshimura ... just yet another big consolidation, predicted many many years ago, the trend, that is.
23:27 πŸ”— RichardG_ has quit IRC (Read error: Connection reset by peer)
23:27 πŸ”— RichardG has joined #archiveteam-bs
23:48 πŸ”— godane i'm uploading www.abc.net.au/news/2013 urls

irclogger-viewer