Time |
Nickname |
Message |
00:12
🔗
|
godane |
so looks like i can now brute force the NASA docs |
00:12
🔗
|
godane |
using the real url vs the url that redirects |
00:17
🔗
|
DoomTay |
I remember you said that a load seemed to be gone. Now might be a good time to double-check those? |
00:28
🔗
|
|
venture37 has left |
00:36
🔗
|
|
Ravenloft has joined #archiveteam |
00:43
🔗
|
|
ris has quit IRC () |
00:48
🔗
|
|
ccordova has quit IRC (Remote host closed the connection) |
01:02
🔗
|
|
zhongfu has quit IRC (Quit: cya losers) |
01:03
🔗
|
|
zhongfu has joined #archiveteam |
01:06
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
01:22
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
01:23
🔗
|
|
Emcy_ has joined #archiveteam |
01:28
🔗
|
|
Start has joined #archiveteam |
01:32
🔗
|
|
davidar_ has joined #archiveteam |
01:35
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
01:42
🔗
|
|
nertzy has quit IRC (Read error: Operation timed out) |
02:02
🔗
|
|
tfgbd_znc has quit IRC (Ping timeout: 633 seconds) |
02:16
🔗
|
|
Fake-Name has joined #archiveteam |
02:17
🔗
|
|
Fake-Nam1 has quit IRC (Read error: Operation timed out) |
02:23
🔗
|
|
dashcloud has quit IRC (Ping timeout: 250 seconds) |
02:26
🔗
|
|
dashcloud has joined #archiveteam |
02:42
🔗
|
|
nertzy has joined #archiveteam |
02:51
🔗
|
|
JesseW has joined #archiveteam |
03:08
🔗
|
|
antomati_ has joined #archiveteam |
03:08
🔗
|
|
swebb sets mode: +o antomati_ |
03:10
🔗
|
|
oli_ has joined #archiveteam |
03:16
🔗
|
|
TC01 has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
Igloo has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
godane has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
remsen has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
botpie91 has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
khaoohs_ has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
nwf_ has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
Coderjoe has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
MMovie has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
luckcolor has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
oli has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
ploop has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
Lord_Nigh has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
antomatic has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
sivoais_ has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
SirCmpwn has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
bwn has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
Atom-- has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
mhazinsk has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
phuzion has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
rossdylan has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
aMunster has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
beardicus has quit IRC (hub.efnet.us ircd.choopa.net) |
03:16
🔗
|
|
khaoohs_ has joined #archiveteam |
03:16
🔗
|
|
TC01 has joined #archiveteam |
03:16
🔗
|
|
Igloo has joined #archiveteam |
03:16
🔗
|
|
godane has joined #archiveteam |
03:16
🔗
|
|
ploop has joined #archiveteam |
03:16
🔗
|
|
sivoais_ has joined #archiveteam |
03:16
🔗
|
|
remsen has joined #archiveteam |
03:16
🔗
|
|
SirCmpwn has joined #archiveteam |
03:16
🔗
|
|
bwn has joined #archiveteam |
03:16
🔗
|
|
rossdylan has joined #archiveteam |
03:16
🔗
|
|
beardicus has joined #archiveteam |
03:16
🔗
|
|
ircd.choopa.net sets mode: +o beardicus |
03:16
🔗
|
|
swebb sets mode: +o beardicus |
03:16
🔗
|
|
LordNigh2 has joined #archiveteam |
03:18
🔗
|
|
aMunster has joined #archiveteam |
03:29
🔗
|
|
remsen1 has joined #archiveteam |
03:32
🔗
|
|
oli_ is now known as oli |
03:32
🔗
|
|
LordNigh2 is now known as Lord_Nigh |
03:32
🔗
|
|
luckcolor has joined #archiveteam |
03:33
🔗
|
|
Coderjoe has joined #archiveteam |
03:35
🔗
|
|
TC01 has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
Igloo has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
godane has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
remsen has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
aMunster has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
khaoohs_ has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
ploop has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
sivoais_ has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
SirCmpwn has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
bwn has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
rossdylan has quit IRC (hub.efnet.us ircd.choopa.net) |
03:35
🔗
|
|
beardicus has quit IRC (hub.efnet.us ircd.choopa.net) |
03:37
🔗
|
|
bwn_ has joined #archiveteam |
03:38
🔗
|
|
khaoohs_ has joined #archiveteam |
03:38
🔗
|
|
TC01 has joined #archiveteam |
03:38
🔗
|
|
Igloo has joined #archiveteam |
03:38
🔗
|
|
godane has joined #archiveteam |
03:38
🔗
|
|
ploop has joined #archiveteam |
03:38
🔗
|
|
sivoais_ has joined #archiveteam |
03:38
🔗
|
|
SirCmpwn has joined #archiveteam |
03:38
🔗
|
|
rossdylan has joined #archiveteam |
03:38
🔗
|
|
beardicus has joined #archiveteam |
03:38
🔗
|
|
ircd.choopa.net sets mode: +o beardicus |
03:38
🔗
|
|
swebb sets mode: +o beardicus |
03:38
🔗
|
|
nwf_ has joined #archiveteam |
03:39
🔗
|
|
aMunster has joined #archiveteam |
03:43
🔗
|
|
jmad980 has quit IRC (Ping timeout: 633 seconds) |
03:46
🔗
|
|
nwf_ has quit IRC (Read error: Connection reset by peer) |
03:46
🔗
|
|
nwf_ has joined #archiveteam |
03:50
🔗
|
|
bwn_ is now known as bwn |
03:56
🔗
|
|
jmad980 has joined #archiveteam |
03:56
🔗
|
|
Start has joined #archiveteam |
04:53
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
05:01
🔗
|
|
Sk1d has joined #archiveteam |
05:20
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
05:23
🔗
|
|
ndizzle has quit IRC (Read error: Operation timed out) |
05:28
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
06:17
🔗
|
|
tomwsmf-a has quit IRC (Ping timeout: 258 seconds) |
06:23
🔗
|
|
BartoCH has quit IRC (Read error: Connection reset by peer) |
06:32
🔗
|
|
BartoCH has joined #archiveteam |
06:41
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
07:16
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
07:17
🔗
|
|
Wuked has joined #archiveteam |
07:29
🔗
|
|
Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) |
07:33
🔗
|
|
atomotic has joined #archiveteam |
08:02
🔗
|
|
aMunster has quit IRC (Read error: Operation timed out) |
08:10
🔗
|
|
aMunster has joined #archiveteam |
08:13
🔗
|
|
schbirid has joined #archiveteam |
08:13
🔗
|
|
phuzion has joined #archiveteam |
08:52
🔗
|
|
pikhq has quit IRC (Ping timeout: 506 seconds) |
09:11
🔗
|
|
pikhq has joined #archiveteam |
09:21
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
09:30
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:30
🔗
|
|
Wuked has joined #archiveteam |
09:42
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
09:48
🔗
|
|
kristian_ has joined #archiveteam |
09:49
🔗
|
|
BartoCH has joined #archiveteam |
09:50
🔗
|
|
dashcloud has joined #archiveteam |
09:50
🔗
|
|
pfallenop has quit IRC (Ping timeout: 260 seconds) |
09:58
🔗
|
|
pfallenop has joined #archiveteam |
09:58
🔗
|
|
mhazinsk has joined #archiveteam |
10:19
🔗
|
|
metal_cam has joined #archiveteam |
10:20
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
10:25
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
10:28
🔗
|
|
dashcloud has joined #archiveteam |
10:33
🔗
|
|
WinterFox has joined #archiveteam |
10:33
🔗
|
|
WinterFox has quit IRC (Read error: Connection reset by peer) |
10:33
🔗
|
|
W1nterFox has joined #archiveteam |
10:44
🔗
|
arkiver |
strange |
10:44
🔗
|
arkiver |
cuorsera works in webarchiveplayer |
10:45
🔗
|
arkiver |
but does like it doesn't exist in the wayback machine https://wayback-beta.archive.org/web/20160627062439/https://class.coursera.org/virology-001 |
10:45
🔗
|
arkiver |
I'll be writing a little tool anyway to export full courses from the Wayback Machine |
10:51
🔗
|
|
Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) |
10:53
🔗
|
arkiver |
SketchCow: currently everything that is being uploaded from FOS to IA is not deriving. |
10:57
🔗
|
|
Wuked has joined #archiveteam |
11:35
🔗
|
|
brayden has quit IRC (Read error: Connection reset by peer) |
11:36
🔗
|
|
brayden has joined #archiveteam |
11:36
🔗
|
|
swebb sets mode: +o brayden |
11:42
🔗
|
SketchCow |
Yeah, I'll ask today. |
11:45
🔗
|
Igloo |
Other than coursera are there any other active projects? |
11:52
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
11:55
🔗
|
|
dashcloud has joined #archiveteam |
12:03
🔗
|
|
kristian_ has quit IRC (Leaving) |
12:22
🔗
|
|
ndiddy has joined #archiveteam |
12:37
🔗
|
SketchCow |
The FOS is focused on Coursera uploads and ArchiveBot uploads. Once it pushes through both backlogs, it should be much more effective very quickly. |
12:41
🔗
|
|
rolfb has joined #archiveteam |
12:42
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:45
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
12:48
🔗
|
SketchCow |
Finally added date stamps to the homer shover page |
12:49
🔗
|
|
dashcloud has joined #archiveteam |
13:47
🔗
|
BartoCH |
SketchCow: do you think you guys will be able to get everything before it closes? The deadline is damn close. |
13:48
🔗
|
arkiver |
chfoo: can you send me the logs of coursera as soon as possible? |
13:49
🔗
|
Igloo |
We have about 30 to grab - But I think they're all issued at the moment so I don't know if we can reissue arkiver ? |
13:49
🔗
|
arkiver |
they're mostly issued at the moment |
13:49
🔗
|
arkiver |
I'll have a look at what we can requeue |
13:49
🔗
|
arkiver |
you only have the aboriginaled item right? |
13:51
🔗
|
Igloo |
Yep |
13:51
🔗
|
arkiver |
ok |
13:59
🔗
|
|
Wuked has quit IRC (Ping timeout: 258 seconds) |
13:59
🔗
|
|
VADemon has joined #archiveteam |
14:00
🔗
|
|
Wuked has joined #archiveteam |
14:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
14:11
🔗
|
luckcolor |
arkiver: for the last round pls update the scripts to ignore 500 errors as fatal |
14:13
🔗
|
arkiver |
whicih project |
14:13
🔗
|
luckcolor |
coursera |
14:13
🔗
|
arkiver |
that's with the retrfinished right? |
14:14
🔗
|
luckcolor |
yeah the one i told you yesterday |
14:14
🔗
|
arkiver |
that is fixed |
14:14
🔗
|
arkiver |
currently only Igloo, Medowar and HCross have items though |
14:14
🔗
|
arkiver |
and we're almost done |
14:14
🔗
|
|
dashcloud has joined #archiveteam |
14:15
🔗
|
HCross |
main issue we are running into is FOS, and as the saying goes "too many cooks spoil the broth" |
14:15
🔗
|
luckcolor |
arkiver if you need me to run some items let me know |
14:15
🔗
|
luckcolor |
Hcross: wat? XD |
14:16
🔗
|
arkiver |
hold on, I'll add zino's target |
14:17
🔗
|
Igloo |
Uploads are taking forever luckcolor |
14:18
🔗
|
luckcolor |
understandable |
14:19
🔗
|
HCross |
luckcolor, we are talking several hours per upload |
14:19
🔗
|
|
W1nterFox has quit IRC (Remote host closed the connection) |
14:19
🔗
|
HCross |
arkiver, thats more like it |
14:19
🔗
|
arkiver |
zino: we are using your target, when the project is finished, can you sync it to FOS? |
14:19
🔗
|
arkiver |
I'll give you a target on FOS by then |
14:25
🔗
|
|
trs80 has quit IRC (Ping timeout: 190 seconds) |
14:32
🔗
|
|
rolfb has quit IRC (Leaving...) |
15:06
🔗
|
|
Piet0r has left |
15:22
🔗
|
|
nertzy has quit IRC (Read error: Operation timed out) |
15:26
🔗
|
|
Aranje has joined #archiveteam |
15:41
🔗
|
wumpus |
no reply from email to dnshistory.com about getting a copy of their database, so I filled out the support webform on their site... |
15:41
🔗
|
arkiver |
We'll do a project for the site |
15:55
🔗
|
|
trs80 has joined #archiveteam |
16:00
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
16:04
🔗
|
zino |
arkiver: Sure. |
16:06
🔗
|
zino |
Who admins FOS? When I start syncing there I might want to talk TCP settings. |
16:09
🔗
|
|
JesseW has joined #archiveteam |
16:14
🔗
|
|
DoomTay has joined #archiveteam |
16:16
🔗
|
|
metal_cam is now known as metalcamp |
16:25
🔗
|
|
RichardG has joined #archiveteam |
16:26
🔗
|
wumpus |
I've reach # 271 on the URLTeam leaderboard... go me! |
16:33
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
16:54
🔗
|
xmc |
zino: fos is run by SketchCow |
16:54
🔗
|
SketchCow |
TCP settings, you say |
16:56
🔗
|
xmc |
clearly, zino wants you to help him fill the disk faster than you can empty it |
17:06
🔗
|
|
Ravenloft has quit IRC (Ping timeout: 244 seconds) |
17:20
🔗
|
SketchCow |
Under "Three people cared", the fos.textfiles.com/ARCHIVETEAM page now accurately shows the timestamp of archivebot uploads. So no more gaps in the "Uploaded" column going forward. |
17:21
🔗
|
|
Tomcat_ has joined #archiveteam |
17:22
🔗
|
|
Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) |
17:33
🔗
|
|
rolfb has joined #archiveteam |
17:34
🔗
|
|
arrith has quit IRC (Read error: Operation timed out) |
17:42
🔗
|
|
rolfb has quit IRC (Ping timeout: 506 seconds) |
18:02
🔗
|
zino |
SketchCow: Based on my attempt at upload to IA:s s3 servers I suspect they aren't configured for long-haul TCP. I'll annoy you about that when I start uploading to FOS later. |
18:08
🔗
|
wumpus |
I'm happy to discuss networking with you, zino, I work at IA and am familiar with our setup. |
18:08
🔗
|
wumpus |
(We have a LOT of people uploading from far away.) |
18:08
🔗
|
* |
HCross hides in a corner |
18:08
🔗
|
HCross |
yea. sorry about the constant 300Mbps from france |
18:09
🔗
|
wumpus |
We have 40 gigabits |
18:09
🔗
|
Frogging |
same :p |
18:09
🔗
|
Frogging |
re: the big france pipe |
18:10
🔗
|
wumpus |
Our "network weathermap" is down today because we're adding a new ISP, but in general we only have a few gigabits incoming out of 40 max. |
18:10
🔗
|
|
vitzli has joined #archiveteam |
18:10
🔗
|
HCross |
wumpus, who have you got coming in now? |
18:11
🔗
|
wumpus |
Our friends at ISC, mostly. |
18:11
🔗
|
HCross |
might I suggest #archiveteam-bs |
18:14
🔗
|
SketchCow |
Or #internetarchive |
18:15
🔗
|
zino |
wumpus: So what I'm talking about is just regular TCP window scaling. I have a hard time getting above more than a few megabytes per connection uploading to IA from Sweden. To sustain anything reasonable I have to use 30-50 parallel uploads. In contrast to Amazon S3 US West and US East where I can push quite a bit more. |
18:16
🔗
|
SketchCow |
Spoiler is I'm not going to modify FOS settings |
18:16
🔗
|
zino |
Aw. :-( |
18:17
🔗
|
SketchCow |
But feel free to work through what possible bottlenecks are in place, see what possible solutions there are. |
18:18
🔗
|
zino |
Well. I haven't tried uploading to FOS yet. So maybe it will magically work without problems... |
18:19
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
18:21
🔗
|
SketchCow |
You literally wanted me to make network changes without interfacing with the network first? |
18:22
🔗
|
SketchCow |
Bold move, soldier |
18:22
🔗
|
wumpus |
highly parallel uploads are the way to go, given all of the places that packets can be lost that neither of us control. |
18:23
🔗
|
zino |
SketchCow: Nope, I wanted to talk with you later about maybe modifying the TCP buffers and window scaling after starting uploads. |
18:24
🔗
|
SketchCow |
I mean, that won't happen. |
18:24
🔗
|
zino |
Noted. |
18:24
🔗
|
SketchCow |
But really, next time do a thing and find the thing not working before coming up with potential solutions or announcing your intention to demand a change. |
18:25
🔗
|
zino |
No demand. I wanted a contact for when it inevitably fucked up. |
18:46
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:50
🔗
|
|
dashcloud has joined #archiveteam |
18:50
🔗
|
|
Tomcat_ has quit IRC (Remote host closed the connection) |
18:54
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
19:08
🔗
|
|
MMovie has joined #archiveteam |
19:10
🔗
|
|
tomwsmf-a has joined #archiveteam |
19:15
🔗
|
|
tfgbd_znc has joined #archiveteam |
19:15
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
19:18
🔗
|
|
MMovie has quit IRC (Leaving.) |
19:19
🔗
|
|
dashcloud has joined #archiveteam |
19:21
🔗
|
arkiver |
scripts for arto are updated for the final run |
19:21
🔗
|
arkiver |
now skipping any bad URLs. |
19:29
🔗
|
|
superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) |
19:37
🔗
|
|
superkuh has joined #archiveteam |
19:48
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
19:54
🔗
|
|
dashcloud has joined #archiveteam |
19:54
🔗
|
swebb |
My gawker crawl using the latest heritrix is 1.3M urls in (250GB of data downloaded) with 1.6M urls queued |
19:54
🔗
|
|
Wuked has joined #archiveteam |
19:54
🔗
|
|
Wuked has quit IRC (Client Quit) |
19:57
🔗
|
luckcolor |
swebb: how is heritrix in comparison of other crawlers? |
19:57
🔗
|
luckcolor |
i'm curious cause i haven't used it |
19:58
🔗
|
swebb |
Heritrix is the Internet Archive crawler. It's grown easier to use over the years, but still takes a little to get it working properly. For our grabs, we want to not swamp the site, but we don't use the defaults either to crawl the site in a few weeks (depending on the size of the site). It generates warc files which the IA and wayback machine use. |
20:00
🔗
|
swebb |
Once it's up and running, it has a (newly improved) web interface where you can monitor your jobs and start new ones. |
20:00
🔗
|
luckcolor |
yeah i saw it once |
20:00
🔗
|
|
ris has joined #archiveteam |
20:00
🔗
|
luckcolor |
i mean what do you mean about the defaults |
20:00
🔗
|
luckcolor |
what do you usually change |
20:01
🔗
|
swebb |
Oh, the crawl waits are pretty slow - like 5-30 seconds between urls per hostname, but I change things around a bit. The config that I'm using for the gawker crawl is: https://gist.github.com/scumola/6c2dc8c96d2165e9fb608d49c15e0ebf |
20:06
🔗
|
swebb |
I also have heritrix crawls for doc2doc and portalgraphics.net running at the same time. |
20:06
🔗
|
arkiver |
nice |
20:06
🔗
|
DoomTay |
Oh? |
20:07
🔗
|
DoomTay |
I had no idea anyone was doing that second one |
20:07
🔗
|
arkiver |
not so sure about portalgraphics though |
20:07
🔗
|
DoomTay |
Uh? |
20:07
🔗
|
arkiver |
did you change it to grab the URLs that a 'normal' crawler wouldn't get? |
20:07
🔗
|
swebb |
https://www.evernote.com/l/ACn1_ZZ6tDtGeID-OC6Jh8eZWsAh0kzMV5U |
20:07
🔗
|
swebb |
Like ignore robots.txt? No. |
20:08
🔗
|
swebb |
It still honors robots.txt |
20:08
🔗
|
DoomTay |
I think he means find a specific XML which points to other part s of a flash movie |
20:08
🔗
|
DoomTay |
That is basically a making-of of a given image |
20:08
🔗
|
swebb |
It'll parse urls from all kinds of different types of files. It'll even parse javascript to render urls, I think. |
20:09
🔗
|
swebb |
I doubt if it's grabbing multi-part flash video. |
20:10
🔗
|
swebb |
The kind of crawls that I do with heritrix are frequently the kind where IA already has several copies of the site over time, but people here want a 'full' crawl. I guess that IA does incremental crawls or something? |
20:11
🔗
|
DoomTay |
It's structured something like http://www.portalgraphics.net/pg/movie/pg_player/res_movie_data.php?mid=80728&lang=en though that URL is "hidden" in comments so I don't know if that can be picked up |
20:11
🔗
|
DoomTay |
Another thing to look out for is that if the site is overloaded, a page will be rendered as a message basically saying the lines are full |
20:12
🔗
|
swebb |
Yea, it is not smart enough to catch those. |
20:12
🔗
|
DoomTay |
Should be pretty easy to notice afterwards with its small content-length |
20:13
🔗
|
DoomTay |
Then again, a "doesn't exist anymore" substitute will probably be even smaller |
20:13
🔗
|
swebb |
Can you tell when I started the gawker crawl? :) https://www.evernote.com/l/AClf9BO53KVARaJs_jTGzQ59jXrrnpvJhao |
20:13
🔗
|
|
bauruine has quit IRC (Ping timeout: 260 seconds) |
20:23
🔗
|
|
bauruine has joined #archiveteam |
20:42
🔗
|
|
MMovie has joined #archiveteam |
20:47
🔗
|
|
ohhdemgir has joined #archiveteam |
20:59
🔗
|
|
Wuked has joined #archiveteam |
21:01
🔗
|
|
Wuked has quit IRC (Client Quit) |
21:02
🔗
|
|
Wuked has joined #archiveteam |
21:03
🔗
|
|
Wuked has quit IRC (Client Quit) |
21:14
🔗
|
|
maseck has quit IRC (Remote host closed the connection) |
21:14
🔗
|
|
Wuked has joined #archiveteam |
21:22
🔗
|
|
maseck has joined #archiveteam |
21:29
🔗
|
wumpus |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
21:30
🔗
|
mhazinsk |
yahoosucks |
21:30
🔗
|
wumpus |
(that felt a little silly) |
21:30
🔗
|
xmc |
you think it makes YOU silly |
21:48
🔗
|
|
ohhdemgir has quit IRC (Read error: Operation timed out) |
21:50
🔗
|
|
metalcamp has quit IRC (Ping timeout: 250 seconds) |
22:07
🔗
|
|
redlob has quit IRC (Ping timeout: 260 seconds) |
22:11
🔗
|
|
ohhdemgir has joined #archiveteam |
22:13
🔗
|
|
redlob has joined #archiveteam |
22:33
🔗
|
|
Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…) |
22:36
🔗
|
joepie91 |
lol |
22:46
🔗
|
|
RichardG has quit IRC (Ping timeout: 260 seconds) |
22:47
🔗
|
|
j08nY has joined #archiveteam |
23:11
🔗
|
|
oituniet has joined #archiveteam |
23:13
🔗
|
|
oituniet has quit IRC (Client Quit) |
23:30
🔗
|
|
RichardG has joined #archiveteam |
23:40
🔗
|
|
antonizoo has quit IRC (Ping timeout: 260 seconds) |
23:41
🔗
|
|
arkiver has quit IRC (Ping timeout: 260 seconds) |
23:42
🔗
|
|
lesderid has quit IRC (Ping timeout: 260 seconds) |
23:42
🔗
|
|
Sanqui has quit IRC (Ping timeout: 260 seconds) |
23:42
🔗
|
|
lesderid has joined #archiveteam |
23:52
🔗
|
|
remsen1 has quit IRC (ZNC 1.6.2 - http://znc.in) |
23:52
🔗
|
|
Sanqui has joined #archiveteam |
23:52
🔗
|
|
remsen has joined #archiveteam |
23:53
🔗
|
|
arkiver has joined #archiveteam |
23:53
🔗
|
|
swebb sets mode: +o arkiver |