#archiveteam 2016-07-24,Sun

↑back Search

Time Nickname Message
00:33 🔗 godane has joined #archiveteam
00:43 🔗 arkiver SketchCow: I'd like to start begin writing a project for tumblr
00:44 🔗 nightpool arkiver: xmc was going to start one I think
00:44 🔗 nightpool we were just talking about it in -bs
00:44 🔗 xmc i haven't gotten around to it, so if you do it first then you win
00:44 🔗 xmc i have some ideas about how to do it that might be valuable, but they're in scrollback of #-bs already
00:45 🔗 xmc it would require two projects and a tiny bit of serverside code but yipdw is willing
01:00 🔗 WinterFox has joined #archiveteam
01:07 🔗 DoomTay has joined #archiveteam
01:08 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
01:08 🔗 rossdylan has quit IRC (Read error: Operation timed out)
01:28 🔗 Coderjoe has joined #archiveteam
01:50 🔗 godane has quit IRC (Read error: Operation timed out)
02:25 🔗 philpem has quit IRC (Ping timeout: 260 seconds)
02:36 🔗 Aranje has quit IRC (Ping timeout: 260 seconds)
03:49 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
04:12 🔗 Aranje has joined #archiveteam
04:35 🔗 Coderjoe has joined #archiveteam
04:46 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:52 🔗 Sk1d has joined #archiveteam
05:04 🔗 TC02 has quit IRC (Ping timeout: 246 seconds)
05:27 🔗 TC02 has joined #archiveteam
05:31 🔗 DoomTay has quit IRC (Quit: Page closed)
05:31 🔗 n00b484 has joined #archiveteam
05:32 🔗 n00b484 it seems to be working now but cant connect using MRIC
05:35 🔗 n00b484 has quit IRC (Client Quit)
05:40 🔗 TC01 has quit IRC (Ping timeout: 260 seconds)
05:53 🔗 yipdw has quit IRC (Read error: Operation timed out)
05:54 🔗 TC01 has joined #archiveteam
05:54 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
05:56 🔗 JesseW has joined #archiveteam
06:08 🔗 yipdw has joined #archiveteam
06:33 🔗 tomwsmf has quit IRC (Ping timeout: 258 seconds)
07:07 🔗 metal_cam has joined #archiveteam
07:13 🔗 Emcy has quit IRC (Read error: Operation timed out)
07:14 🔗 TC02 has quit IRC (Ping timeout: 246 seconds)
07:21 🔗 TC02 has joined #archiveteam
07:30 🔗 TC02 has quit IRC (Ping timeout: 246 seconds)
07:31 🔗 TC02 has joined #archiveteam
07:31 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
08:32 🔗 W1nterFox has joined #archiveteam
08:35 🔗 WinterFox has quit IRC (Ping timeout: 492 seconds)
09:12 🔗 philpem has joined #archiveteam
09:16 🔗 schbirid has joined #archiveteam
09:48 🔗 pfallenop has quit IRC (Ping timeout: 260 seconds)
09:54 🔗 arkiver xmc: I'll have a look at the logs of #-bs. I'm not sure how spread out it is over the logs, so if you are around, maybe you could give me a small overview of the idea?
09:54 🔗 arkiver I'm basically thinking a warrior project to create the WARCs for the wayback machine.
09:54 🔗 arkiver We'l be extracting new tumblr sites as we archive them
10:00 🔗 arkiver I see the main idea is to seperate archiving images and blogs
10:02 🔗 arkiver I've started a little on the warrior project now
10:04 🔗 arkiver We should be able to do some test runs soon
10:08 🔗 arkiver xmc: what would be the reason for seperating the grab of images from the other files?
10:13 🔗 Scuttle has joined #archiveteam
10:13 🔗 Scuttle hum...I have an archivebot pullig data from one of my sites, how do I find out what's going on? :)
10:14 🔗 arkiver hi
10:14 🔗 GLaDOS has quit IRC (Read error: Connection reset by peer)
10:14 🔗 arkiver What is your site?
10:14 🔗 GLaDOS has joined #archiveteam
10:15 🔗 arkiver The dashboard of ArchiveBot can be found here http://archivebot.com/
10:17 🔗 Scuttle randomwaffle.gbs.fm
10:17 🔗 arkiver Looks like it's on the dashboard
10:17 🔗 Scuttle right
10:18 🔗 Scuttle if someone wants, I can rsync the whole site somewhere
10:18 🔗 arkiver not sure what the grab is doing now though
10:18 🔗 Scuttle well, downloading everything it seems :)
10:19 🔗 pfallenop has joined #archiveteam
10:19 🔗 Igloo Yeah it's going to put them onto the internet archive
10:19 🔗 Igloo Notes say you're the last surviving waffleimages mirror?
10:19 🔗 Scuttle may very well be
10:19 🔗 Igloo How big is the repo?
10:19 🔗 Scuttle around 330 gigs
10:20 🔗 Igloo We can delay the crawl / make it less resource intensive if it's causing you problems
10:21 🔗 Scuttle ah, that's no problem, I just noticed my access-logs were a lot bigger than they used to be :D
10:21 🔗 Igloo aha :)
10:21 🔗 Igloo Seems someone wants to preserve it forever
10:21 🔗 Igloo So got added to the crawlers to download & upload to the internet archive / viewable in the way back machine
10:21 🔗 terg has joined #archiveteam
10:21 🔗 Scuttle aight
10:22 🔗 Scuttle it's mostly forum-linked pics though I think...
10:22 🔗 Igloo 182Gb done so about half way
10:22 🔗 Scuttle and that would be broken anywas since I don't have access to the waffleimages-domain
10:22 🔗 Igloo No other notes :-/
10:24 🔗 terg post the KAT raid, apart from proxies and such, is there any database lying about of KAT torrent info?
10:24 🔗 Igloo There is a torrent (ironically) kicking around somewhere
10:25 🔗 terg very unfortunate, I wonder if it'd be a good idea to do regular (incremental if possible) archivals of large torrent indexes
10:25 🔗 terg whereabouts should I look?
10:25 🔗 arkiver I think that is a good idea
10:26 🔗 arkiver I'm planning on getting something going to go by all torrent sites
10:26 🔗 arkiver we already have a good archive of rutracker
10:26 🔗 Igloo Good idea, Scale is an issue
10:26 🔗 arkiver But let's move this to #archiveteam-bs
10:26 🔗 terg gotcha
11:04 🔗 Atom-- has quit IRC (Read error: Operation timed out)
11:04 🔗 winterfox has joined #archiveteam
11:05 🔗 Emcy has joined #archiveteam
11:06 🔗 W1nterFox has quit IRC (Ping timeout: 492 seconds)
11:07 🔗 Emcy has quit IRC (Client Quit)
11:30 🔗 Emcy has joined #archiveteam
11:57 🔗 Emcy_ has joined #archiveteam
12:05 🔗 Sanqui has left .
12:05 🔗 Sanqui has joined #archiveteam
12:06 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
12:06 🔗 terg has quit IRC (My Mac has gone to sleep. ZZZzzz…)
12:10 🔗 Emcy has quit IRC (Read error: Operation timed out)
12:16 🔗 Coderjoe has joined #archiveteam
12:32 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
12:35 🔗 BartoCH has joined #archiveteam
12:49 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
12:49 🔗 BlueMaxim has joined #archiveteam
13:06 🔗 atomotic has joined #archiveteam
13:13 🔗 kristian_ has joined #archiveteam
13:20 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
13:23 🔗 Coderjoe has joined #archiveteam
13:29 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:40 🔗 vOYtEC has joined #archiveteam
13:49 🔗 GLaDOS has quit IRC (Quit: Oh crap, I died.)
13:49 🔗 GLaDOS has joined #archiveteam
13:56 🔗 REiN^ has quit IRC (Ping timeout: 244 seconds)
13:58 🔗 redlob has quit IRC (ZNC - http://znc.in)
14:03 🔗 redlob has joined #archiveteam
14:14 🔗 BlueMaxim has quit IRC (Quit: Leaving)
14:15 🔗 REiN^ has joined #archiveteam
15:05 🔗 DoomTay has joined #archiveteam
15:30 🔗 xmc arkiver: split the images off because otherwise you'll get all the images copied into every blog's warc. that will multiply your grab size by like fifty
15:51 🔗 arkiver IA is currently working on something to deduplicate WARCs.
15:51 🔗 arkiver Duplicate records will be replaced by revisit records
15:51 🔗 arkiver I'll ask around, but size might not matter too much
15:52 🔗 arkiver Bandwidth is a more a problem
15:52 🔗 Kazzy Jumping in on this - I'm assuming that means IA will only keep 1 copy of everything, even if the same file is uploaded in every WARC?
15:53 🔗 arkiver If the same file is upload in 50 WARCs, 49 WARCs will have revisit records and 1 WARC will hold the actual file
15:53 🔗 arkiver (If I understood IA's idea correctly)
15:53 🔗 Kazzy revisit records being just a pointer to the actual file?
15:53 🔗 arkiver Yeah
15:54 🔗 DoomTay Does this mean no more cases of multiple timestamps for things that haven
15:54 🔗 DoomTay 't haven't changed at all?
15:54 🔗 Kazzy awesome, had always wondered if there was an easy way to do that, though it sounds like a ton of processing work. Makes sense for it to be done on IA's end really
15:54 🔗 arkiver As far as I know, just a redirect to an other record, without making the URL and timestamp in the Wayback Machine look like it is redirected
15:54 🔗 arkiver DoomTay: no, see above ^
15:55 🔗 arkiver The idea isn't totally clear yet though, still being discussed, so things might change
16:00 🔗 DoomTay Okay, so after looking at what a revisit record is, it looks like it IS some form of redundancy removal
16:00 🔗 DoomTay Yay
16:01 🔗 DoomTay Though I doubt this would save more than, say, a few gigs worth of filespace
16:08 🔗 Kazzy huh
16:09 🔗 Kazzy Replacing a whole file and just throwing a pointer in saves tons
16:09 🔗 Kazzy Even with just AT's stuff, there's an absolutel TON of duplication, due to the nature of what we do
16:09 🔗 Kazzy When you scale that up to IA, that's terabytes, at least
16:11 🔗 DoomTay Hell, maybe petabytes
16:12 🔗 DoomTay Speaking of files, anyone know how copies with matching digests can still have different lengths? Is that actually the length of the WARC?
16:12 🔗 DoomTay Like with http://web.archive.org/cdx/search/cdx?url=http://www.doomworld.com/batman/main.JPG&output=json
16:15 🔗 schbirid wget has a dedup flag for warc btw
16:16 🔗 Kazzy that only goes so far though schbirid, I guess that works for ArchiveBot, but not warrior projects
17:02 🔗 kristian_ has quit IRC (Leaving)
17:29 🔗 Scuttle has left Leaving
17:50 🔗 JesseW has joined #archiveteam
18:12 🔗 tomwsmf has joined #archiveteam
18:38 🔗 metal_cam is now known as metalcamp
19:05 🔗 Start has quit IRC (Read error: Connection reset by peer)
19:05 🔗 Start has joined #archiveteam
19:16 🔗 schbirid no idea what it means but http://ddl-warez.to/ has a notice "only 68 days left"
19:17 🔗 DoomTay ....and it has freaking CloudFlare
19:19 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
19:21 🔗 schbirid of cousre
20:09 🔗 maseck has quit IRC (Quit: No Ping reply in 180 seconds.)
20:09 🔗 maseck has joined #archiveteam
20:50 🔗 Medowar ...and there we go http://www.bbc.com/news/business-36879831
20:50 🔗 Medowar .title
20:51 🔗 kristian_ has joined #archiveteam
20:51 🔗 Kazzy .title http://www.bbc.co.uk/news/business-36879831
20:51 🔗 Kazzy sod it, "Verizon 'agrees $5bn Yahoo deal'"
21:07 🔗 HCross2 Oh God. Yahoo and AOL having a baby
21:08 🔗 Kazzy is now known as Kaz
21:10 🔗 DoomTay This is gonna be fun...
21:26 🔗 Actium has joined #archiveteam
21:52 🔗 godane has joined #archiveteam
21:54 🔗 Nemo_bis Supercookies for everyone!
21:54 🔗 Emcy_ has quit IRC (Read error: Operation timed out)
21:55 🔗 Emcy_ has joined #archiveteam
22:00 🔗 pguth_ has quit IRC (Remote host closed the connection)
22:00 🔗 pguth_ has joined #archiveteam
22:02 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
22:02 🔗 Emcy_ has quit IRC (Read error: Operation timed out)
22:04 🔗 Emcy_ has joined #archiveteam
22:10 🔗 kristian_ has quit IRC (Leaving)
22:15 🔗 winterfox has quit IRC (Ping timeout: 492 seconds)
22:33 🔗 ndiddy has joined #archiveteam
22:37 🔗 Coderjoe has quit IRC (Ping timeout: 260 seconds)
22:38 🔗 Coderjoe has joined #archiveteam
22:51 🔗 dashcloud has joined #archiveteam
23:01 🔗 pguth_ has quit IRC (Remote host closed the connection)
23:01 🔗 pguth_ has joined #archiveteam
23:11 🔗 kristian_ has joined #archiveteam
23:19 🔗 Coderjoe has quit IRC (Ping timeout: 260 seconds)
23:22 🔗 Swaxx has joined #archiveteam
23:23 🔗 Swaxx hi anyone here?
23:23 🔗 * Actium says hi and goes back into hiding
23:24 🔗 Swaxx how can i post a link in a forumpost?
23:25 🔗 DoomTay I don't think this is the place for that
23:25 🔗 Swaxx ow okay,
23:25 🔗 Swaxx do you know the irc adress to extratorrents?
23:28 🔗 Swaxx has quit IRC (Quit: Page closed)
23:34 🔗 Coderjoe has joined #archiveteam
23:37 🔗 BlueMaxim has joined #archiveteam
23:55 🔗 closure has joined #archiveteam

irclogger-viewer