Time |
Nickname |
Message |
00:18
🔗
|
|
DoomTay has joined #archiveteam |
00:19
🔗
|
|
Stiletto has joined #archiveteam |
00:19
🔗
|
|
tomwsmf-a has joined #archiveteam |
00:24
🔗
|
|
DiscantX has joined #archiveteam |
00:30
🔗
|
|
JesseW has joined #archiveteam |
00:57
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
00:57
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
01:12
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
01:18
🔗
|
|
ats has quit IRC (Ping timeout: 244 seconds) |
01:23
🔗
|
|
Stiletto has quit IRC (Ping timeout: 244 seconds) |
01:24
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
01:28
🔗
|
|
Coderjoe has joined #archiveteam |
01:34
🔗
|
|
Froggypwn has quit IRC (~ Trillian Astra - www.trillian.im ~) |
01:36
🔗
|
|
philpem has quit IRC (Ping timeout: 260 seconds) |
02:01
🔗
|
|
coretx has quit IRC (Read error: Operation timed out) |
02:02
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
02:02
🔗
|
|
RichardG has joined #archiveteam |
02:04
🔗
|
|
coretx has joined #archiveteam |
02:05
🔗
|
|
JesseW has joined #archiveteam |
02:10
🔗
|
|
tomwsmf-a has quit IRC (Read error: Operation timed out) |
02:18
🔗
|
|
Stiletto has joined #archiveteam |
02:36
🔗
|
|
ats has joined #archiveteam |
02:45
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
02:45
🔗
|
|
RichardG has joined #archiveteam |
02:46
🔗
|
|
ats has quit IRC (Read error: Operation timed out) |
02:52
🔗
|
|
ats has joined #archiveteam |
03:09
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
03:09
🔗
|
|
RichardG has joined #archiveteam |
03:31
🔗
|
|
Froggypwn has joined #archiveteam |
03:33
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
03:48
🔗
|
|
Kitaru_ has joined #archiveteam |
03:52
🔗
|
|
RichardG has quit IRC (Ping timeout: 370 seconds) |
03:54
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |
03:57
🔗
|
|
RichardG has joined #archiveteam |
04:01
🔗
|
|
Coderjoe has joined #archiveteam |
04:05
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:08
🔗
|
|
RichardG has quit IRC (Ping timeout: 260 seconds) |
04:11
🔗
|
|
Sk1d has joined #archiveteam |
04:12
🔗
|
|
RichardG has joined #archiveteam |
04:13
🔗
|
|
Kitaru_ has quit IRC (Quit: This computer has gone to sleep) |
04:25
🔗
|
|
db48x` has quit IRC (Read error: Connection reset by peer) |
04:30
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 260 seconds) |
04:31
🔗
|
|
db48x has joined #archiveteam |
05:05
🔗
|
|
metalcamp has joined #archiveteam |
05:12
🔗
|
|
Trevor has joined #archiveteam |
05:13
🔗
|
|
Trevor has quit IRC (Client Quit) |
05:31
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
06:13
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
06:16
🔗
|
|
dashcloud has joined #archiveteam |
06:54
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
07:01
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
07:07
🔗
|
JesseW |
Delightfully weird history: https://motherboard.vice.com/read/the-secret-nuclear-history-of-cat-videos |
07:13
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
07:27
🔗
|
|
RichardG has joined #archiveteam |
07:55
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
08:16
🔗
|
|
WinterFox has joined #archiveteam |
08:25
🔗
|
|
Gfy has quit IRC (Read error: Operation timed out) |
08:32
🔗
|
|
Gfy has joined #archiveteam |
08:50
🔗
|
|
Wuked has joined #archiveteam |
08:50
🔗
|
|
DiscantX has joined #archiveteam |
08:57
🔗
|
|
zhongfu_ has joined #archiveteam |
08:57
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
09:04
🔗
|
|
zhongfu_ has quit IRC (Ping timeout: 260 seconds) |
09:04
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
09:05
🔗
|
|
zhongfu has joined #archiveteam |
09:12
🔗
|
|
DiscantX has joined #archiveteam |
09:26
🔗
|
|
BlueMaxim has joined #archiveteam |
09:39
🔗
|
|
atomotic has joined #archiveteam |
09:51
🔗
|
|
zhongfu has quit IRC (Remote host closed the connection) |
09:58
🔗
|
|
Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…) |
10:00
🔗
|
|
Wuked has joined #archiveteam |
10:07
🔗
|
|
Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) |
10:14
🔗
|
|
zhongfu has joined #archiveteam |
10:17
🔗
|
|
Wuked has joined #archiveteam |
10:20
🔗
|
|
Wuked has quit IRC (Client Quit) |
10:22
🔗
|
|
Wuked has joined #archiveteam |
10:24
🔗
|
|
Wuked has quit IRC (Client Quit) |
10:30
🔗
|
|
Wuked has joined #archiveteam |
10:32
🔗
|
|
zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) |
10:32
🔗
|
|
Wuked has quit IRC (Client Quit) |
10:32
🔗
|
|
GLaDOS has joined #archiveteam |
10:34
🔗
|
|
zhongfu has joined #archiveteam |
10:45
🔗
|
|
Wuked has joined #archiveteam |
10:47
🔗
|
|
Wuked has quit IRC (Client Quit) |
10:57
🔗
|
|
morbus_ has quit IRC (Read error: Operation timed out) |
11:09
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
11:16
🔗
|
|
Wuked has joined #archiveteam |
12:05
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:10
🔗
|
|
BlueMaxim has joined #archiveteam |
12:23
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
13:38
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:48
🔗
|
|
VADemon has joined #archiveteam |
13:58
🔗
|
|
Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…) |
14:10
🔗
|
|
Wuked has joined #archiveteam |
14:17
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
14:59
🔗
|
|
WinterFox has quit IRC (Read error: Operation timed out) |
15:05
🔗
|
|
philpem has joined #archiveteam |
15:06
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
15:07
🔗
|
|
BartoCH has joined #archiveteam |
15:12
🔗
|
|
Wuked has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
15:21
🔗
|
|
r3c0d3x has quit IRC (Ping timeout: 260 seconds) |
15:23
🔗
|
|
r3c0d3x has joined #archiveteam |
15:25
🔗
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
15:25
🔗
|
|
BartoCH has joined #archiveteam |
15:26
🔗
|
|
BartoCH has quit IRC (Client Quit) |
15:27
🔗
|
|
BartoCH has joined #archiveteam |
15:54
🔗
|
|
Start has joined #archiveteam |
15:59
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
16:03
🔗
|
|
ploop has quit IRC (Ping timeout: 244 seconds) |
16:04
🔗
|
|
ploop has joined #archiveteam |
16:15
🔗
|
|
DoomTay has joined #archiveteam |
16:18
🔗
|
|
JesseW has joined #archiveteam |
16:33
🔗
|
arkiver |
JesseW: JW_work: I'll have a look at examiner.com |
16:34
🔗
|
Frogging |
not sure how much we could get in 2 days |
16:34
🔗
|
DoomTay |
Isn't ArchiveBot already doing it, even if it's probably not going off of the sitemaps? |
16:34
🔗
|
Frogging |
DoomTay: it's slow |
16:35
🔗
|
Frogging |
ArchiveBot is one host crawling a site. Warrior projects are many hosts downloading a site in a more organized item-based fashion |
16:36
🔗
|
arkiver |
We are now saving around 1 TB of news with NewsBuddy every day! |
16:37
🔗
|
Frogging |
however in the latter case, the speed is still limited because without rate limiting that's just about the same thing as a DDoS. |
16:37
🔗
|
Frogging |
still faster and more efficient generally though. but necessitates writing scripts that are aware of the site structure |
16:40
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
16:42
🔗
|
godane |
!a http://www.postnewspapers.com.au/ |
16:42
🔗
|
godane |
i put it in archivebot |
16:48
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
16:49
🔗
|
|
dashcloud has joined #archiveteam |
16:56
🔗
|
JW_work |
Frogging: #archivebot is *much* slower than nearly any source site, even with a delay of zero, due to the administrative back and forth with the central server. |
17:01
🔗
|
Frogging |
yes, there's that too |
17:48
🔗
|
arkiver |
VideoBot is recording livestream again. |
17:48
🔗
|
arkiver |
VideoBot can be found at #videobot |
17:49
🔗
|
arkiver |
It is especially handy to record and save Twitter videos directly to the Internet Archive as video item |
17:49
🔗
|
arkiver |
vine and periscope are also supported. periscope needs a fix though |
17:50
🔗
|
arkiver |
Coming up soon for VideoBot is following a twitter hashtag and downloading all twitter and/or periscope videos using that hashtag and uploading those videos to IA as video items. |
17:50
🔗
|
arkiver |
All videos are also saved into the Wayback Machine |
18:45
🔗
|
|
Start has joined #archiveteam |
19:08
🔗
|
DFJustin |
niiice |
19:36
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
19:37
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
19:39
🔗
|
|
DiscantX has joined #archiveteam |
19:40
🔗
|
|
dashcloud has joined #archiveteam |
19:46
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
19:47
🔗
|
|
Start has joined #archiveteam |
19:52
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
20:07
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
20:08
🔗
|
|
mutoso has quit IRC (Quit: leaving) |
20:18
🔗
|
|
mutoso has joined #archiveteam |
20:37
🔗
|
|
dxrt has quit IRC (Read error: Operation timed out) |
20:38
🔗
|
|
jspiros has quit IRC (Read error: Operation timed out) |
20:41
🔗
|
|
dxrt has joined #archiveteam |
20:45
🔗
|
|
Kitaru has joined #archiveteam |
21:12
🔗
|
FalconK |
wait what wait what back and forth is there for archivebot in the middle of a crawl? |
21:13
🔗
|
FalconK |
what I see is more like the site being crawled throttles large amounts of requests from a single IP. |
21:13
🔗
|
FalconK |
and the deduplication lookups take a significant amount of cycles, which can't be parallelized because of python |
21:15
🔗
|
Frogging |
FalconK: it's because it has to synchronously (!) contact the controller for every request |
21:16
🔗
|
FalconK |
it probably doesn't have to do that |
21:16
🔗
|
FalconK |
I bet we could make that better |
21:16
🔗
|
FalconK |
it's managing the crawl locally, so what it exchanges with the controller can't be more than status updates |
21:17
🔗
|
FalconK |
the code is unfortunately complicated though |
21:17
🔗
|
DoomTay |
I think I remember someone, probably Asparagir, saying that the only thing stopping rollout of any big updates is the high number of long-term jobs underway |
21:18
🔗
|
FalconK |
hmmmmmmmmmmmmmmmmmm |
21:18
🔗
|
FalconK |
complex. |
21:18
🔗
|
FalconK |
that is a much bigger problem than just serviceability |
21:18
🔗
|
FalconK |
so a couple days ago the PDU on ananiel died |
21:18
🔗
|
FalconK |
I am told there was a great popping, and crackling, and smoke issued |
21:19
🔗
|
FalconK |
and with it, about 14 long-running jobs |
21:19
🔗
|
FalconK |
now they are started from the beginning |
21:19
🔗
|
FalconK |
on the bright side, ananiel was updated before taking on new work |
21:20
🔗
|
FalconK |
but if you set a stopfile, you can expect it to halt sometime between a few hours and 4 months from then (and it can't service small jobs while waiting for the long jobs to go away) |
21:21
🔗
|
FalconK |
initially, one would suppose that the trick is to make it into a distributed workload |
21:21
🔗
|
|
bzc6p has joined #archiveteam |
21:21
🔗
|
|
swebb sets mode: +o bzc6p |
21:21
🔗
|
* |
FalconK ! |
21:21
🔗
|
FalconK |
bzc6p! you're the one with vt.idiota.hu, yes? |
21:22
🔗
|
bzc6p |
um, yes |
21:22
🔗
|
|
robink has quit IRC (Ping timeout: 633 seconds) |
21:22
🔗
|
bzc6p |
why? |
21:22
🔗
|
HCross |
im still up for running a pipeline |
21:22
🔗
|
FalconK |
I'm trying to figure out why you use it for url lists, and not like pastebin |
21:22
🔗
|
FalconK |
and then remove the A record for it after the job starts but before it is completed |
21:22
🔗
|
DoomTay |
I think he lost the job and can't remember the exact URL |
21:23
🔗
|
FalconK |
there have been a few of those jobs |
21:23
🔗
|
FalconK |
and sometimes the thing running them dies a painful death due to power failures, the epoll_wait bug in wpull, or something else |
21:23
🔗
|
FalconK |
and then the next pipeline that takes up the job just tries infinitely to pull the URL list, and fails because the A record is gone and there's nothing there anymore |
21:24
🔗
|
FalconK |
and the job takes up a job slot and keeps reporting infinitely |
21:24
🔗
|
DoomTay |
I'm not sure the A record disappearing is his fault. Right now the site as a whole can't be reached. |
21:24
🔗
|
FalconK |
when I get one of those jobs and notice this condition, the only way to fix it is to add a hosts entry for vt.idiota.hu so it goes to like google.com, so it can "finish" the job |
21:24
🔗
|
FalconK |
yeah |
21:25
🔗
|
FalconK |
... wait, is it URL lists, or were we trying to archive the site? |
21:25
🔗
|
FalconK |
because it looked like URL lists |
21:25
🔗
|
bzc6p |
Stop. |
21:25
🔗
|
bzc6p |
So. |
21:26
🔗
|
bzc6p |
The list resides on my computer, I start a webserver and announce my IP to a dyndns site. |
21:26
🔗
|
bzc6p |
I wait until it finishes and then shut down my server. |
21:26
🔗
|
|
nwf_ has quit IRC (Read error: Operation timed out) |
21:26
🔗
|
bzc6p |
It happened twice out of like 14 cases that there was an outage meanwhile and the task got stuck. |
21:27
🔗
|
bzc6p |
Let me remark that it is a bug in the ArchiveBot software, which was also admitted by yipdw. |
21:27
🔗
|
bzc6p |
I do it this way because I don't want to use third parties unless necessary. |
21:27
🔗
|
FalconK |
yes, you're triggering a bug |
21:27
🔗
|
* |
FalconK nods |
21:28
🔗
|
bzc6p |
But if you think it is a considerable problem that sometimes that happens, I may consider doing it in a way that this bug is very less likely to happen. |
21:28
🔗
|
FalconK |
it's probably worth fixing the bug, since that is not the only case that can trigger it, but |
21:28
🔗
|
|
nwf_ has joined #archiveteam |
21:28
🔗
|
FalconK |
erstwhile, I know I've had to clear it on my pipeline ~5 times, and when it happens on a pipeline that is less watched, it takes quite a while |
21:29
🔗
|
FalconK |
on the other hand, the next time you bring the host up, if it stays that way for longer than the DNS failure is cached, that pipeline will expunge the job |
21:29
🔗
|
FalconK |
but, this method also risks keeping the DNS failure cached on the same pipeline that takes up your new job |
21:29
🔗
|
FalconK |
in which case both jobs will be stuck |
21:30
🔗
|
* |
FalconK shrugs |
21:30
🔗
|
bzc6p |
Please tell me more about how your pipeline had to be cleared 5 times while I think I've never (or only once) targeted a job to your pipeline |
21:30
🔗
|
bzc6p |
but in #archiveteam-bs |
21:30
🔗
|
FalconK |
you're right, what's really needed is a limit on DNS retries |
21:30
🔗
|
bzc6p |
(I use Frogging, now Luckolors pipelines) |
21:30
🔗
|
FalconK |
it's not a big deal. I just wanted to understand what it was you were trying to do. |
21:32
🔗
|
FalconK |
what's probably happening is that the job gets taken up by ananiel after the pipeline it was on dies (because ananiel is so big that condition is probable, though once I noticed it on Cadbury too) |
21:32
🔗
|
FalconK |
which guarantees the bug is triggered. |
21:32
🔗
|
FalconK |
so thanks for explaining! perhaps this is the next thing I should attack. |
21:33
🔗
|
* |
bzc6p is a bit confused |
21:33
🔗
|
bzc6p |
I think I'll just upload it to some drop site and shit |
21:33
🔗
|
bzc6p |
then it won't happen and won't trigger problems |
21:34
🔗
|
FalconK |
that is probably easier, at any rate |
21:34
🔗
|
* |
bzc6p doesn't like using third parties as our main goal is saving shit from third parties |
21:34
🔗
|
FalconK |
the bug is still a bug that needs fixing |
21:34
🔗
|
FalconK |
well the URL list is hopefully ephemeral and wherever you put it is probably unrelated to the crawl target |
21:35
🔗
|
|
bzc6p sets mode: +o FalconK |
21:35
🔗
|
FalconK |
ty |
21:36
🔗
|
bzc6p |
------------------------- |
21:36
🔗
|
FalconK |
it is an interesting point, though, that someone would be fairly easily able to DoS archivebot with these jobs |
21:36
🔗
|
* |
FalconK goes back to writing |
21:36
🔗
|
bzc6p |
In fact, I came here to inform you that 8086.net doesn't give a crap on us saving their stuff |
21:36
🔗
|
HCross |
#archiveteam-bs |
21:36
🔗
|
bzc6p |
https://secure.8086.net/portal/viewticket.php?tid=NTM-405143&c=9JIYc46J |
21:37
🔗
|
bzc6p |
or, in fact, they do, instantly activated CloudFlare but don't mind deleting everything |
21:38
🔗
|
bzc6p |
You have heard ArchiveTeam News |
21:39
🔗
|
bzc6p |
"There is no way you can archive the >billion pages on the site and trying to do so is causing issues for other users on the site." |
21:40
🔗
|
bzc6p |
I can hear SketchCow saying "yeah, deleting everything will also cause issues for other users of the site" |
21:40
🔗
|
DoomTay |
Why did Timothy post the same thing several times? |
21:42
🔗
|
bzc6p |
After wumpus reported he hasn't received reply, Timothy decided to send the letter to three contact addresses, and after one and a half day without reply, he decided to send it to the uppercase initial email addresses too, in case their shitty mailserver is case-sensitive. That's 6 times altogether. |
21:45
🔗
|
|
jspiros has joined #archiveteam |
21:56
🔗
|
DoomTay |
How much has been saved of DNSHistory befor e then anyway? |
21:57
🔗
|
bzc6p |
0% |
21:58
🔗
|
DoomTay |
Ow |
21:59
🔗
|
bzc6p |
In fact, nearly 0,02% |
21:59
🔗
|
DoomTay |
Andn ow he's deliberately blocking our efforts to get more... |
21:59
🔗
|
DoomTay |
That's just evil |
22:06
🔗
|
SketchCow |
Where's my hug |
22:07
🔗
|
* |
JW_work points to the giant hug monster gathering dust in the corner |
22:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:12
🔗
|
|
TC01_ is now known as TC01 |
22:12
🔗
|
SketchCow |
Who hosed it off last |
22:12
🔗
|
|
Lune has joined #archiveteam |
22:13
🔗
|
|
dashcloud has joined #archiveteam |
22:16
🔗
|
* |
JW_work points to the giant hoses coming out of it, used for self-washing |
22:17
🔗
|
FalconK |
I wonder what makes a person do that, enable cloudflare anti-flood for a site they plan to shut down in 2 days due to funding. |
22:17
🔗
|
FalconK |
perhaps they didn't like paying the CDN charges for our crawl. maybe it was more bandwidth than they normally ever see. |
22:18
🔗
|
JW_work |
that seems likely, yeah |
22:23
🔗
|
|
bzc6p has left |
22:26
🔗
|
Lune |
lol are these the dns people? |
22:27
🔗
|
ranma |
is it on ATW? |
22:28
🔗
|
joepie91 |
you can still reply to the ticket, by the way |
22:28
🔗
|
joepie91 |
even if it is closed |
22:29
🔗
|
joepie91 |
[00:17] <FalconK> perhaps they didn't like paying the CDN charges for our crawl. maybe it was more bandwidth than they normally ever see. |
22:29
🔗
|
joepie91 |
I very strongly doubt there's a CDN involved here |
22:41
🔗
|
FalconK |
well cloudflare |
22:44
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
22:47
🔗
|
|
aschmitz_ has quit IRC (Read error: Operation timed out) |
22:48
🔗
|
|
aschmitz_ has joined #archiveteam |
23:12
🔗
|
joepie91 |
you don't pay cloudflare for bandwidth |
23:12
🔗
|
joepie91 |
that's not how their business model works |
23:12
🔗
|
joepie91 |
so that's a non-argument :P |
23:16
🔗
|
|
K4k has quit IRC (Quit: WeeChat 1.5) |
23:16
🔗
|
|
K4k has joined #archiveteam |
23:19
🔗
|
|
robink has joined #archiveteam |
23:21
🔗
|
arkiver |
I'm able to write some scripts for a little project for examiner.com tomorrow. But I'm not sure if we have enough time to save everything |
23:23
🔗
|
DoomTay |
Wait, what makes you think it's in danger? |
23:26
🔗
|
Kitaru |
DoomTay: they announced it's closing on the 10th |
23:27
🔗
|
DoomTay |
Oh... |
23:39
🔗
|
|
tomwsmf-a has joined #archiveteam |
23:46
🔗
|
|
Start has joined #archiveteam |
23:47
🔗
|
Lune |
bloody french |
23:59
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |