Time |
Nickname |
Message |
00:38
🔗
|
wp494 |
I'm seeing plenty of people working on or having already finished mirrors of louis' channel, but that said, the more copies, the better |
00:41
🔗
|
DoomTay |
I know one guy did everything, then it turned out his copies were of inferior quality |
00:47
🔗
|
|
Fake-Name has joined #archiveteam |
01:02
🔗
|
|
Fake-Nam1 has joined #archiveteam |
01:02
🔗
|
|
Fake-Name has quit IRC (Read error: Operation timed out) |
01:03
🔗
|
|
ris has quit IRC () |
01:09
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
01:16
🔗
|
|
Fake-Nam1 has quit IRC (Ping timeout: 250 seconds) |
01:18
🔗
|
|
Fake-Name has joined #archiveteam |
01:33
🔗
|
|
Fusl has quit IRC (Max SendQ exceeded) |
01:44
🔗
|
|
Fusl has joined #archiveteam |
01:53
🔗
|
|
Stilett0 has joined #archiveteam |
01:54
🔗
|
|
pfallenop has quit IRC (Ping timeout: 260 seconds) |
01:55
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
02:07
🔗
|
|
WinterFox has joined #archiveteam |
02:15
🔗
|
|
Ungstein2 has joined #archiveteam |
02:15
🔗
|
|
Ungstein2 has quit IRC (Connection closed) |
02:28
🔗
|
|
DoomTay has quit IRC (Ping timeout: 268 seconds) |
02:32
🔗
|
|
DoomTay has joined #archiveteam |
02:53
🔗
|
|
pfallenop has joined #archiveteam |
03:30
🔗
|
|
Fake-Name has quit IRC (Read error: Operation timed out) |
03:31
🔗
|
|
Fake-Name has joined #archiveteam |
03:43
🔗
|
|
ploop has quit IRC (ZNC - 1.6.0 - http://znc.in) |
03:51
🔗
|
|
vitzli has joined #archiveteam |
03:58
🔗
|
|
fie_ has joined #archiveteam |
04:01
🔗
|
|
fie has quit IRC (Ping timeout: 244 seconds) |
04:08
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
04:24
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
04:33
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
04:39
🔗
|
|
VADemon has joined #archiveteam |
04:41
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
04:49
🔗
|
|
ploop has joined #archiveteam |
04:50
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:52
🔗
|
|
Fake-Name has quit IRC (Read error: Operation timed out) |
04:55
🔗
|
|
galaxy_an has quit IRC (Ping timeout: 260 seconds) |
04:56
🔗
|
|
ploop has quit IRC (Remote host closed the connection) |
04:56
🔗
|
|
Sk1d has joined #archiveteam |
04:56
🔗
|
|
Sk1d has quit IRC (Connection closed) |
04:58
🔗
|
|
Sk1d has joined #archiveteam |
05:04
🔗
|
|
Fake-Name has joined #archiveteam |
05:12
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
05:15
🔗
|
|
dashcloud has joined #archiveteam |
05:23
🔗
|
|
ploop has joined #archiveteam |
05:23
🔗
|
|
ploop has quit IRC (Remote host closed the connection) |
05:30
🔗
|
|
ploop has joined #archiveteam |
05:59
🔗
|
|
ploop has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in) |
06:02
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
06:03
🔗
|
|
ploop has joined #archiveteam |
06:08
🔗
|
|
ravetcofx has quit IRC (Ping timeout: 506 seconds) |
06:10
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
06:14
🔗
|
|
ploop has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in) |
06:15
🔗
|
|
ravetcofx has joined #archiveteam |
06:21
🔗
|
|
ploop has joined #archiveteam |
06:22
🔗
|
|
RichardG has joined #archiveteam |
06:56
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
08:13
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
08:16
🔗
|
|
bzc6p has joined #archiveteam |
08:16
🔗
|
|
swebb sets mode: +o bzc6p |
08:17
🔗
|
bzc6p |
So we need a Warrior project for dnshistory.org (closing July 10). I was able to assemble a discovery script last night, but won't have time in the following days to write scripts for grabbing, so I leave some information here for a potential project manager. |
08:18
🔗
|
bzc6p |
For every TLD (1,365) the site lists known domain names. 50 domains/page. My script is fining out how many pages of domains are for each TLD. |
08:18
🔗
|
bzc6p |
My suggestion for an item could be one or more pages of a TLD, then it can be labelled well, like com:3001:3010 |
08:18
🔗
|
bzc6p |
com:3001-3010 |
08:20
🔗
|
bzc6p |
Then those pages, the pages of those domains, and their subpages (e.g. record history, subdomains etc.) can be grabbed. |
08:22
🔗
|
bzc6p |
It is said that one can access a page from the previous page (e.g. 323rd page from the 322nd), in wget probably with referer. This seems to be true, with some exceptions (sometimes it works without the referer, sometimes it doesn't even with the referer, and sometimtes the site forces you to go one by one, you can't say "I need page 1000, trust me I come from page 999) |
08:22
🔗
|
bzc6p |
These problems seem to occur only for big TLDs (com, info etc.) |
08:23
🔗
|
bzc6p |
So I'm doing the discovery. It makes good progress (except for the big domains, they take a lot of time), but I can already provide partial results if needed for items, and probably almost done by this evening (except for some big TLDs) |
08:23
🔗
|
bzc6p |
Project channel #greatlookup |
08:25
🔗
|
bzc6p |
If it was a week later, I could write the grabber script myself, but unfortunately it is not. |
08:25
🔗
|
bzc6p |
-- End of message |
08:25
🔗
|
* |
bzc6p gotta go do his chores |
08:30
🔗
|
joepie91 |
bzc6p: I've only seen the referer problem occur beyond page 100k or so |
08:34
🔗
|
|
bzc6p has left |
09:39
🔗
|
|
bzc6p has joined #archiveteam |
09:39
🔗
|
|
swebb sets mode: +o bzc6p |
09:40
🔗
|
bzc6p |
Discovery for all TLDs done except for com, net, org, biz, xyz, info |
09:40
🔗
|
bzc6p |
These need some more time. |
09:40
🔗
|
|
bzc6p has left |
10:00
🔗
|
|
Jeroen52 has quit IRC (Ping timeout: 260 seconds) |
10:15
🔗
|
|
Jeroen52 has joined #archiveteam |
10:37
🔗
|
|
dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) |
10:44
🔗
|
|
dashcloud has joined #archiveteam |
11:37
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
11:41
🔗
|
|
dashcloud has joined #archiveteam |
11:44
🔗
|
arkiver |
I'm currently working on a warrior project for dnshistory |
11:44
🔗
|
arkiver |
We'll use the pages as items |
11:45
🔗
|
arkiver |
So I don't think we need a discovery to get a list of domains |
11:45
🔗
|
arkiver |
We just need to know the number of pages |
12:08
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:19
🔗
|
joepie91 |
arkiver: we don't, and probably can't |
12:24
🔗
|
arkiver |
joepie91: sure we can |
12:25
🔗
|
arkiver |
(the number of pages) |
12:39
🔗
|
|
j08nY has joined #archiveteam |
12:59
🔗
|
|
ndiddy has joined #archiveteam |
13:32
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
13:38
🔗
|
|
BartoCH has joined #archiveteam |
14:18
🔗
|
|
Fake-Name has quit IRC (Ping timeout: 260 seconds) |
14:24
🔗
|
|
hive-mind has quit IRC (Remote host closed the connection) |
14:26
🔗
|
|
hive-mind has joined #archiveteam |
14:29
🔗
|
|
WinterFox has quit IRC (Read error: Operation timed out) |
14:48
🔗
|
|
metalcamp has joined #archiveteam |
14:54
🔗
|
|
Smiley has joined #archiveteam |
14:56
🔗
|
|
SmileyG has quit IRC (Read error: Operation timed out) |
14:56
🔗
|
|
j08nY has quit IRC (Read error: Operation timed out) |
14:57
🔗
|
|
TC01_ has joined #archiveteam |
14:58
🔗
|
|
d_rebel has quit IRC (Read error: Operation timed out) |
14:58
🔗
|
|
cadbury_ has quit IRC (Read error: Connection reset by peer) |
14:59
🔗
|
|
d_rebel has joined #archiveteam |
15:00
🔗
|
|
brayden has quit IRC (Read error: Operation timed out) |
15:02
🔗
|
|
Baljem_ has joined #archiveteam |
15:02
🔗
|
|
Baljem has quit IRC (Read error: Connection reset by peer) |
15:03
🔗
|
|
maseck has quit IRC (Ping timeout: 633 seconds) |
15:03
🔗
|
|
TC01 has quit IRC (Ping timeout: 633 seconds) |
15:03
🔗
|
|
blblblbl has quit IRC (Read error: Connection reset by peer) |
15:04
🔗
|
|
maseck has joined #archiveteam |
15:08
🔗
|
|
xmc has quit IRC (Ping timeout: 633 seconds) |
15:08
🔗
|
|
jch has joined #archiveteam |
15:09
🔗
|
|
dashcloud has quit IRC (Ping timeout: 633 seconds) |
15:13
🔗
|
|
dxrt- has quit IRC (Ping timeout: 633 seconds) |
15:15
🔗
|
|
cadbury_ has joined #archiveteam |
15:16
🔗
|
|
DoomTay has joined #archiveteam |
15:16
🔗
|
|
jch has quit IRC (Read error: Connection reset by peer) |
15:20
🔗
|
|
aschmitz has quit IRC (Excess Flood) |
15:21
🔗
|
|
jch has joined #archiveteam |
15:21
🔗
|
|
aschmitz has joined #archiveteam |
15:26
🔗
|
|
xmc has joined #archiveteam |
15:26
🔗
|
|
swebb sets mode: +o xmc |
15:36
🔗
|
|
dashcloud has joined #archiveteam |
15:42
🔗
|
|
VADemon has joined #archiveteam |
15:42
🔗
|
|
brayden has joined #archiveteam |
15:42
🔗
|
|
swebb sets mode: +o brayden |
15:42
🔗
|
|
JesseW has joined #archiveteam |
16:08
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
16:12
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
16:13
🔗
|
|
brayden has quit IRC (Read error: Operation timed out) |
16:15
🔗
|
|
cadbury_ has quit IRC (Read error: Operation timed out) |
16:22
🔗
|
|
jmad980 has quit IRC (Ping timeout: 244 seconds) |
16:24
🔗
|
|
jch has quit IRC (Read error: Connection reset by peer) |
16:26
🔗
|
|
cadbury_ has joined #archiveteam |
16:26
🔗
|
|
xmc has joined #archiveteam |
16:26
🔗
|
|
swebb sets mode: +o xmc |
16:28
🔗
|
|
jch has joined #archiveteam |
16:59
🔗
|
|
jmad980 has joined #archiveteam |
17:02
🔗
|
|
Fake-Name has joined #archiveteam |
17:30
🔗
|
|
JesseW has joined #archiveteam |
17:42
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
17:44
🔗
|
|
jmad980 has quit IRC (Read error: Operation timed out) |
17:46
🔗
|
|
cadbury_ has quit IRC (Read error: Operation timed out) |
17:51
🔗
|
|
cadbury_ has joined #archiveteam |
17:59
🔗
|
|
jmad980 has joined #archiveteam |
18:12
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:16
🔗
|
|
dashcloud has joined #archiveteam |
18:18
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:22
🔗
|
|
dashcloud has joined #archiveteam |
18:40
🔗
|
|
yipdw has quit IRC (Read error: Operation timed out) |
18:46
🔗
|
|
yipdw has joined #archiveteam |
18:48
🔗
|
|
arkiver2 has joined #archiveteam |
18:48
🔗
|
|
swebb sets mode: +o arkiver2 |
18:52
🔗
|
|
arkiver2 has quit IRC (Remote host closed the connection) |
19:03
🔗
|
|
Tomcat_ has joined #archiveteam |
19:08
🔗
|
|
Tomcat_ has quit IRC (Remote host closed the connection) |
19:38
🔗
|
|
DoomTay has joined #archiveteam |
20:00
🔗
|
|
tomwsmf-a has joined #archiveteam |
20:02
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
20:04
🔗
|
|
Froggypwn has quit IRC (Read error: Connection reset by peer) |
20:20
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
20:25
🔗
|
|
Froggypwn has joined #archiveteam |
20:25
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
21:25
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
21:28
🔗
|
|
kristian_ has joined #archiveteam |
21:28
🔗
|
|
dashcloud has joined #archiveteam |
22:01
🔗
|
|
kristian_ has quit IRC (Leaving) |
22:25
🔗
|
|
wyatt8740 has joined #archiveteam |
22:25
🔗
|
|
philpem has joined #archiveteam |
22:25
🔗
|
wyatt8740 |
well, my C program for parsing warc's is quite probably the most horrifically bad C I've ever written |
22:25
🔗
|
wyatt8740 |
but so far it's parsing it :D |
22:27
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:27
🔗
|
yipdw |
if you want to make sure you didn't miss any buffer overflows there is a vast corpus available |
22:27
🔗
|
wyatt8740 |
it looked beautiful, but then I hit a bug trying to use fseeko() and ended up ruining my legibility |
22:27
🔗
|
arkiver |
What does it do exactly? |
22:27
🔗
|
wyatt8740 |
I've been very careful with my mallocs and pointers now :P |
22:27
🔗
|
wyatt8740 |
extracts files from a warc |
22:27
🔗
|
wyatt8740 |
in C |
22:27
🔗
|
yipdw |
there was some kerfuffle about "why there are no C libraries for WARC" and the short version is that string operations in C are horrible |
22:28
🔗
|
wyatt8740 |
there's not even a C++ one |
22:28
🔗
|
wyatt8740 |
that's the shocker |
22:28
🔗
|
|
ravetcofx has quit IRC (Remote host closed the connection) |
22:28
🔗
|
yipdw |
string operations in C++ are also horrible |
22:28
🔗
|
wyatt8740 |
and java ones aren't? -_- |
22:28
🔗
|
Frogging |
at least in C++ you get inbuilt dynamic strings :p |
22:28
🔗
|
wyatt8740 |
^ |
22:28
🔗
|
yipdw |
I don't know what Java has to do with this |
22:28
🔗
|
wyatt8740 |
a java WARC library exists |
22:28
🔗
|
yipdw |
ok |
22:28
🔗
|
yipdw |
so someone decided to write one, that's good |
22:29
🔗
|
wyatt8740 |
...using maven |
22:29
🔗
|
Frogging |
far more Python ones though |
22:29
🔗
|
wyatt8740 |
so not something nice and simple |
22:29
🔗
|
wyatt8740 |
I was trying to parse a WARC from my android phone, so python wasn't really a good option |
22:29
🔗
|
yipdw |
k |
22:31
🔗
|
wyatt8740 |
anyway, my code turned to spaghetti while I was ironing out a bug |
22:31
🔗
|
wyatt8740 |
but it's working |
22:31
🔗
|
|
dashcloud has joined #archiveteam |
22:33
🔗
|
|
ravetcofx has joined #archiveteam |
22:42
🔗
|
arkiver |
chfoo: can you please create a target for 'thomas' on FOS? |
22:42
🔗
|
arkiver |
SketchCow: we're going to do a little project on http://thomas.loc.gov/home/thomas.php |
22:42
🔗
|
arkiver |
It's going away on the 5th |
22:52
🔗
|
Frogging |
awesome |
22:52
🔗
|
Frogging |
the project bit, not the going away bit |
22:52
🔗
|
arkiver |
:) |
22:57
🔗
|
zino |
If you need one quick I happend to be awake and have a sheel ready on eldrimner. :) |
22:57
🔗
|
zino |
arkiver ^ |
22:57
🔗
|
arkiver |
awesome |
22:58
🔗
|
arkiver |
oh, I'll PM you on what to do with the data from coursera |
22:58
🔗
|
zino |
I'll have a target in 2min |
22:58
🔗
|
zino |
OK |
22:58
🔗
|
arkiver |
I won't have the scripts ready yet though in 2 minutes |
22:59
🔗
|
zino |
I'll be awake for another 30min or so. |
22:59
🔗
|
arkiver |
ok |
22:59
🔗
|
Frogging |
code faster or the robots will eat you |
23:00
🔗
|
arkiver |
nooooooo |
23:00
🔗
|
arkiver |
lol |
23:00
🔗
|
zino |
thomas target up on eldrimner.lysator.liu.se |
23:00
🔗
|
arkiver |
awesome |
23:00
🔗
|
arkiver |
thanks! |
23:01
🔗
|
bwn |
you need old glory robot insurance |
23:08
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
23:10
🔗
|
|
antomati_ has quit IRC (Read error: Connection reset by peer) |
23:11
🔗
|
|
antomatic has joined #archiveteam |
23:11
🔗
|
|
swebb sets mode: +o antomatic |
23:12
🔗
|
arkiver |
would anyone have a target with a little bit of space for some discovery files for thomas? |
23:12
🔗
|
|
Emcy_ has joined #archiveteam |
23:13
🔗
|
luckcolor |
i have |
23:13
🔗
|
luckcolor |
it's ssd |
23:14
🔗
|
luckcolor |
how much space do you need |
23:14
🔗
|
luckcolor |
if it's in the rage of 30gb i can manage |
23:14
🔗
|
luckcolor |
arkiver: |
23:15
🔗
|
arkiver |
it's in the range of a few MB probably |
23:15
🔗
|
arkiver |
please PM me the target |
23:15
🔗
|
arkiver |
rsync |
23:15
🔗
|
luckcolor |
ok hold on a sec |
23:16
🔗
|
arkiver |
thaks |
23:16
🔗
|
arkiver |
thanks* |
23:22
🔗
|
|
Emcy has quit IRC (Read error: Operation timed out) |
23:28
🔗
|
|
Emcy has joined #archiveteam |
23:29
🔗
|
|
antomatic has quit IRC (Read error: Connection reset by peer) |
23:29
🔗
|
|
lytv has joined #archiveteam |
23:29
🔗
|
|
Smiley has quit IRC (Remote host closed the connection) |
23:29
🔗
|
|
antomatic has joined #archiveteam |
23:29
🔗
|
|
swebb sets mode: +o antomatic |
23:32
🔗
|
|
vtyl has quit IRC (Read error: Operation timed out) |
23:34
🔗
|
|
Emcy_ has quit IRC (Read error: Operation timed out) |
23:43
🔗
|
|
Smiley has joined #archiveteam |
23:49
🔗
|
chfoo |
arkiver, done |
23:49
🔗
|
arkiver |
thanks! |
23:49
🔗
|
arkiver |
zino: for now we'll be using FOS, if need I'll use your target too |