#archiveteam 2016-07-02,Sat

↑back Search

Time Nickname Message
00:38 🔗 wp494 I'm seeing plenty of people working on or having already finished mirrors of louis' channel, but that said, the more copies, the better
00:41 🔗 DoomTay I know one guy did everything, then it turned out his copies were of inferior quality
00:47 🔗 Fake-Name has joined #archiveteam
01:02 🔗 Fake-Nam1 has joined #archiveteam
01:02 🔗 Fake-Name has quit IRC (Read error: Operation timed out)
01:03 🔗 ris has quit IRC ()
01:09 🔗 j08nY has quit IRC (Quit: Leaving)
01:16 🔗 Fake-Nam1 has quit IRC (Ping timeout: 250 seconds)
01:18 🔗 Fake-Name has joined #archiveteam
01:33 🔗 Fusl has quit IRC (Max SendQ exceeded)
01:44 🔗 Fusl has joined #archiveteam
01:53 🔗 Stilett0 has joined #archiveteam
01:54 🔗 pfallenop has quit IRC (Ping timeout: 260 seconds)
01:55 🔗 Stiletto has quit IRC (Read error: Operation timed out)
02:07 🔗 WinterFox has joined #archiveteam
02:15 🔗 Ungstein2 has joined #archiveteam
02:15 🔗 Ungstein2 has quit IRC (Connection closed)
02:28 🔗 DoomTay has quit IRC (Ping timeout: 268 seconds)
02:32 🔗 DoomTay has joined #archiveteam
02:53 🔗 pfallenop has joined #archiveteam
03:30 🔗 Fake-Name has quit IRC (Read error: Operation timed out)
03:31 🔗 Fake-Name has joined #archiveteam
03:43 🔗 ploop has quit IRC (ZNC - 1.6.0 - http://znc.in)
03:51 🔗 vitzli has joined #archiveteam
03:58 🔗 fie_ has joined #archiveteam
04:01 🔗 fie has quit IRC (Ping timeout: 244 seconds)
04:08 🔗 vitzli has quit IRC (Quit: Leaving)
04:24 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
04:33 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
04:39 🔗 VADemon has joined #archiveteam
04:41 🔗 DoomTay has quit IRC (Quit: Page closed)
04:49 🔗 ploop has joined #archiveteam
04:50 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:52 🔗 Fake-Name has quit IRC (Read error: Operation timed out)
04:55 🔗 galaxy_an has quit IRC (Ping timeout: 260 seconds)
04:56 🔗 ploop has quit IRC (Remote host closed the connection)
04:56 🔗 Sk1d has joined #archiveteam
04:56 🔗 Sk1d has quit IRC (Connection closed)
04:58 🔗 Sk1d has joined #archiveteam
05:04 🔗 Fake-Name has joined #archiveteam
05:12 🔗 dashcloud has quit IRC (Read error: Operation timed out)
05:15 🔗 dashcloud has joined #archiveteam
05:23 🔗 ploop has joined #archiveteam
05:23 🔗 ploop has quit IRC (Remote host closed the connection)
05:30 🔗 ploop has joined #archiveteam
05:59 🔗 ploop has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in)
06:02 🔗 RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
06:03 🔗 ploop has joined #archiveteam
06:08 🔗 ravetcofx has quit IRC (Ping timeout: 506 seconds)
06:10 🔗 tomwsmf-a has quit IRC (Read error: Operation timed out)
06:14 🔗 ploop has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in)
06:15 🔗 ravetcofx has joined #archiveteam
06:21 🔗 ploop has joined #archiveteam
06:22 🔗 RichardG has joined #archiveteam
06:56 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
08:13 🔗 VADemon has quit IRC (Quit: left4dead)
08:16 🔗 bzc6p has joined #archiveteam
08:16 🔗 swebb sets mode: +o bzc6p
08:17 🔗 bzc6p So we need a Warrior project for dnshistory.org (closing July 10). I was able to assemble a discovery script last night, but won't have time in the following days to write scripts for grabbing, so I leave some information here for a potential project manager.
08:18 🔗 bzc6p For every TLD (1,365) the site lists known domain names. 50 domains/page. My script is fining out how many pages of domains are for each TLD.
08:18 🔗 bzc6p My suggestion for an item could be one or more pages of a TLD, then it can be labelled well, like com:3001:3010
08:18 🔗 bzc6p com:3001-3010
08:20 🔗 bzc6p Then those pages, the pages of those domains, and their subpages (e.g. record history, subdomains etc.) can be grabbed.
08:22 🔗 bzc6p It is said that one can access a page from the previous page (e.g. 323rd page from the 322nd), in wget probably with referer. This seems to be true, with some exceptions (sometimes it works without the referer, sometimes it doesn't even with the referer, and sometimtes the site forces you to go one by one, you can't say "I need page 1000, trust me I come from page 999)
08:22 🔗 bzc6p These problems seem to occur only for big TLDs (com, info etc.)
08:23 🔗 bzc6p So I'm doing the discovery. It makes good progress (except for the big domains, they take a lot of time), but I can already provide partial results if needed for items, and probably almost done by this evening (except for some big TLDs)
08:23 🔗 bzc6p Project channel #greatlookup
08:25 🔗 bzc6p If it was a week later, I could write the grabber script myself, but unfortunately it is not.
08:25 🔗 bzc6p -- End of message
08:25 🔗 * bzc6p gotta go do his chores
08:30 🔗 joepie91 bzc6p: I've only seen the referer problem occur beyond page 100k or so
08:34 🔗 bzc6p has left
09:39 🔗 bzc6p has joined #archiveteam
09:39 🔗 swebb sets mode: +o bzc6p
09:40 🔗 bzc6p Discovery for all TLDs done except for com, net, org, biz, xyz, info
09:40 🔗 bzc6p These need some more time.
09:40 🔗 bzc6p has left
10:00 🔗 Jeroen52 has quit IRC (Ping timeout: 260 seconds)
10:15 🔗 Jeroen52 has joined #archiveteam
10:37 🔗 dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
10:44 🔗 dashcloud has joined #archiveteam
11:37 🔗 dashcloud has quit IRC (Read error: Operation timed out)
11:41 🔗 dashcloud has joined #archiveteam
11:44 🔗 arkiver I'm currently working on a warrior project for dnshistory
11:44 🔗 arkiver We'll use the pages as items
11:45 🔗 arkiver So I don't think we need a discovery to get a list of domains
11:45 🔗 arkiver We just need to know the number of pages
12:08 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:19 🔗 joepie91 arkiver: we don't, and probably can't
12:24 🔗 arkiver joepie91: sure we can
12:25 🔗 arkiver (the number of pages)
12:39 🔗 j08nY has joined #archiveteam
12:59 🔗 ndiddy has joined #archiveteam
13:32 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:38 🔗 BartoCH has joined #archiveteam
14:18 🔗 Fake-Name has quit IRC (Ping timeout: 260 seconds)
14:24 🔗 hive-mind has quit IRC (Remote host closed the connection)
14:26 🔗 hive-mind has joined #archiveteam
14:29 🔗 WinterFox has quit IRC (Read error: Operation timed out)
14:48 🔗 metalcamp has joined #archiveteam
14:54 🔗 Smiley has joined #archiveteam
14:56 🔗 SmileyG has quit IRC (Read error: Operation timed out)
14:56 🔗 j08nY has quit IRC (Read error: Operation timed out)
14:57 🔗 TC01_ has joined #archiveteam
14:58 🔗 d_rebel has quit IRC (Read error: Operation timed out)
14:58 🔗 cadbury_ has quit IRC (Read error: Connection reset by peer)
14:59 🔗 d_rebel has joined #archiveteam
15:00 🔗 brayden has quit IRC (Read error: Operation timed out)
15:02 🔗 Baljem_ has joined #archiveteam
15:02 🔗 Baljem has quit IRC (Read error: Connection reset by peer)
15:03 🔗 maseck has quit IRC (Ping timeout: 633 seconds)
15:03 🔗 TC01 has quit IRC (Ping timeout: 633 seconds)
15:03 🔗 blblblbl has quit IRC (Read error: Connection reset by peer)
15:04 🔗 maseck has joined #archiveteam
15:08 🔗 xmc has quit IRC (Ping timeout: 633 seconds)
15:08 🔗 jch has joined #archiveteam
15:09 🔗 dashcloud has quit IRC (Ping timeout: 633 seconds)
15:13 🔗 dxrt- has quit IRC (Ping timeout: 633 seconds)
15:15 🔗 cadbury_ has joined #archiveteam
15:16 🔗 DoomTay has joined #archiveteam
15:16 🔗 jch has quit IRC (Read error: Connection reset by peer)
15:20 🔗 aschmitz has quit IRC (Excess Flood)
15:21 🔗 jch has joined #archiveteam
15:21 🔗 aschmitz has joined #archiveteam
15:26 🔗 xmc has joined #archiveteam
15:26 🔗 swebb sets mode: +o xmc
15:36 🔗 dashcloud has joined #archiveteam
15:42 🔗 VADemon has joined #archiveteam
15:42 🔗 brayden has joined #archiveteam
15:42 🔗 swebb sets mode: +o brayden
15:42 🔗 JesseW has joined #archiveteam
16:08 🔗 xmc has quit IRC (Read error: Operation timed out)
16:12 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:13 🔗 brayden has quit IRC (Read error: Operation timed out)
16:15 🔗 cadbury_ has quit IRC (Read error: Operation timed out)
16:22 🔗 jmad980 has quit IRC (Ping timeout: 244 seconds)
16:24 🔗 jch has quit IRC (Read error: Connection reset by peer)
16:26 🔗 cadbury_ has joined #archiveteam
16:26 🔗 xmc has joined #archiveteam
16:26 🔗 swebb sets mode: +o xmc
16:28 🔗 jch has joined #archiveteam
16:59 🔗 jmad980 has joined #archiveteam
17:02 🔗 Fake-Name has joined #archiveteam
17:30 🔗 JesseW has joined #archiveteam
17:42 🔗 DoomTay has quit IRC (Quit: Page closed)
17:44 🔗 jmad980 has quit IRC (Read error: Operation timed out)
17:46 🔗 cadbury_ has quit IRC (Read error: Operation timed out)
17:51 🔗 cadbury_ has joined #archiveteam
17:59 🔗 jmad980 has joined #archiveteam
18:12 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:16 🔗 dashcloud has joined #archiveteam
18:18 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:22 🔗 dashcloud has joined #archiveteam
18:40 🔗 yipdw has quit IRC (Read error: Operation timed out)
18:46 🔗 yipdw has joined #archiveteam
18:48 🔗 arkiver2 has joined #archiveteam
18:48 🔗 swebb sets mode: +o arkiver2
18:52 🔗 arkiver2 has quit IRC (Remote host closed the connection)
19:03 🔗 Tomcat_ has joined #archiveteam
19:08 🔗 Tomcat_ has quit IRC (Remote host closed the connection)
19:38 🔗 DoomTay has joined #archiveteam
20:00 🔗 tomwsmf-a has joined #archiveteam
20:02 🔗 DoomTay has quit IRC (Quit: Page closed)
20:04 🔗 Froggypwn has quit IRC (Read error: Connection reset by peer)
20:20 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
20:25 🔗 Froggypwn has joined #archiveteam
20:25 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
21:25 🔗 dashcloud has quit IRC (Read error: Operation timed out)
21:28 🔗 kristian_ has joined #archiveteam
21:28 🔗 dashcloud has joined #archiveteam
22:01 🔗 kristian_ has quit IRC (Leaving)
22:25 🔗 wyatt8740 has joined #archiveteam
22:25 🔗 philpem has joined #archiveteam
22:25 🔗 wyatt8740 well, my C program for parsing warc's is quite probably the most horrifically bad C I've ever written
22:25 🔗 wyatt8740 but so far it's parsing it :D
22:27 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:27 🔗 yipdw if you want to make sure you didn't miss any buffer overflows there is a vast corpus available
22:27 🔗 wyatt8740 it looked beautiful, but then I hit a bug trying to use fseeko() and ended up ruining my legibility
22:27 🔗 arkiver What does it do exactly?
22:27 🔗 wyatt8740 I've been very careful with my mallocs and pointers now :P
22:27 🔗 wyatt8740 extracts files from a warc
22:27 🔗 wyatt8740 in C
22:27 🔗 yipdw there was some kerfuffle about "why there are no C libraries for WARC" and the short version is that string operations in C are horrible
22:28 🔗 wyatt8740 there's not even a C++ one
22:28 🔗 wyatt8740 that's the shocker
22:28 🔗 ravetcofx has quit IRC (Remote host closed the connection)
22:28 🔗 yipdw string operations in C++ are also horrible
22:28 🔗 wyatt8740 and java ones aren't? -_-
22:28 🔗 Frogging at least in C++ you get inbuilt dynamic strings :p
22:28 🔗 wyatt8740 ^
22:28 🔗 yipdw I don't know what Java has to do with this
22:28 🔗 wyatt8740 a java WARC library exists
22:28 🔗 yipdw ok
22:28 🔗 yipdw so someone decided to write one, that's good
22:29 🔗 wyatt8740 ...using maven
22:29 🔗 Frogging far more Python ones though
22:29 🔗 wyatt8740 so not something nice and simple
22:29 🔗 wyatt8740 I was trying to parse a WARC from my android phone, so python wasn't really a good option
22:29 🔗 yipdw k
22:31 🔗 wyatt8740 anyway, my code turned to spaghetti while I was ironing out a bug
22:31 🔗 wyatt8740 but it's working
22:31 🔗 dashcloud has joined #archiveteam
22:33 🔗 ravetcofx has joined #archiveteam
22:42 🔗 arkiver chfoo: can you please create a target for 'thomas' on FOS?
22:42 🔗 arkiver SketchCow: we're going to do a little project on http://thomas.loc.gov/home/thomas.php
22:42 🔗 arkiver It's going away on the 5th
22:52 🔗 Frogging awesome
22:52 🔗 Frogging the project bit, not the going away bit
22:52 🔗 arkiver :)
22:57 🔗 zino If you need one quick I happend to be awake and have a sheel ready on eldrimner. :)
22:57 🔗 zino arkiver ^
22:57 🔗 arkiver awesome
22:58 🔗 arkiver oh, I'll PM you on what to do with the data from coursera
22:58 🔗 zino I'll have a target in 2min
22:58 🔗 zino OK
22:58 🔗 arkiver I won't have the scripts ready yet though in 2 minutes
22:59 🔗 zino I'll be awake for another 30min or so.
22:59 🔗 arkiver ok
22:59 🔗 Frogging code faster or the robots will eat you
23:00 🔗 arkiver nooooooo
23:00 🔗 arkiver lol
23:00 🔗 zino thomas target up on eldrimner.lysator.liu.se
23:00 🔗 arkiver awesome
23:00 🔗 arkiver thanks!
23:01 🔗 bwn you need old glory robot insurance
23:08 🔗 VADemon has quit IRC (Quit: left4dead)
23:10 🔗 antomati_ has quit IRC (Read error: Connection reset by peer)
23:11 🔗 antomatic has joined #archiveteam
23:11 🔗 swebb sets mode: +o antomatic
23:12 🔗 arkiver would anyone have a target with a little bit of space for some discovery files for thomas?
23:12 🔗 Emcy_ has joined #archiveteam
23:13 🔗 luckcolor i have
23:13 🔗 luckcolor it's ssd
23:14 🔗 luckcolor how much space do you need
23:14 🔗 luckcolor if it's in the rage of 30gb i can manage
23:14 🔗 luckcolor arkiver:
23:15 🔗 arkiver it's in the range of a few MB probably
23:15 🔗 arkiver please PM me the target
23:15 🔗 arkiver rsync
23:15 🔗 luckcolor ok hold on a sec
23:16 🔗 arkiver thaks
23:16 🔗 arkiver thanks*
23:22 🔗 Emcy has quit IRC (Read error: Operation timed out)
23:28 🔗 Emcy has joined #archiveteam
23:29 🔗 antomatic has quit IRC (Read error: Connection reset by peer)
23:29 🔗 lytv has joined #archiveteam
23:29 🔗 Smiley has quit IRC (Remote host closed the connection)
23:29 🔗 antomatic has joined #archiveteam
23:29 🔗 swebb sets mode: +o antomatic
23:32 🔗 vtyl has quit IRC (Read error: Operation timed out)
23:34 🔗 Emcy_ has quit IRC (Read error: Operation timed out)
23:43 🔗 Smiley has joined #archiveteam
23:49 🔗 chfoo arkiver, done
23:49 🔗 arkiver thanks!
23:49 🔗 arkiver zino: for now we'll be using FOS, if need I'll use your target too

irclogger-viewer