Time |
Nickname |
Message |
00:00
🔗
|
|
cadbury_ has joined #archiveteam |
00:17
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
00:42
🔗
|
|
philpem has quit IRC (Ping timeout: 252 seconds) |
01:03
🔗
|
|
balrog has quit IRC (Read error: Operation timed out) |
01:04
🔗
|
|
mistym has joined #archiveteam |
01:07
🔗
|
|
balrog has joined #archiveteam |
01:07
🔗
|
|
swebb sets mode: +o balrog |
01:13
🔗
|
|
boozehoun has quit IRC (Read error: Operation timed out) |
01:25
🔗
|
|
username1 has joined #archiveteam |
01:27
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
01:29
🔗
|
|
zenguy_pc has joined #archiveteam |
01:29
🔗
|
|
BlueMaxim has joined #archiveteam |
01:37
🔗
|
|
zenguy_pc has quit IRC (Quit: Leaving) |
01:39
🔗
|
|
primus104 has quit IRC (Leaving.) |
01:40
🔗
|
|
zenguy_pc has joined #archiveteam |
01:46
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
01:47
🔗
|
pikhq |
Fun stuff. I found a bug in wget's mirroring logic... |
01:48
🔗
|
pikhq |
It appears it doesn't look at any charset info for any HTML file. Which means if for some reason your website is using UTF-16 (... shockingly I found something that does that), it doesn't work right. |
01:51
🔗
|
pikhq |
Ah, no, that's not quite it. This is a website that is putting out UTF-16 without any indication of the charset, and wget doesn't heuristic the charset. |
01:52
🔗
|
pikhq |
"Fun". |
01:59
🔗
|
pikhq |
Okay, then. *When* you tell it the remote charset is UTF-16 it still looks for ASCII patterns to try and pick out URLs. |
02:01
🔗
|
pikhq |
Time to find how to report a bug to wget. |
02:17
🔗
|
|
lytv has quit IRC (Read error: Operation timed out) |
02:20
🔗
|
|
lytv has joined #archiveteam |
02:33
🔗
|
|
X-Scale` has joined #archiveteam |
02:37
🔗
|
|
X-Scale has quit IRC (Ping timeout: 506 seconds) |
02:37
🔗
|
|
Stiletto has joined #archiveteam |
03:04
🔗
|
|
db48x has joined #archiveteam |
03:06
🔗
|
|
Ravenloft has joined #archiveteam |
04:06
🔗
|
|
SN4T14__ has joined #archiveteam |
04:11
🔗
|
|
SN4T14_ has quit IRC (Ping timeout: 306 seconds) |
04:13
🔗
|
DFJustin |
yeah utf-16 has never worked with wget I think, try wpull |
04:15
🔗
|
pikhq |
Still a bug. :) |
04:17
🔗
|
* |
closure has filed bugs on both wget and curl this year. they fixed the curl one. wget one can get it to delete a file that it's not supposed to touch.. |
04:18
🔗
|
closure |
got curl to behave sensibly when downloading empty files, at last :) |
04:18
🔗
|
pikhq |
DFJustin: From the sounds of it wpull looks rather a lot nicer for some stuff. |
04:24
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
04:41
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
04:44
🔗
|
|
mistym has joined #archiveteam |
05:06
🔗
|
|
underscor has quit IRC (Ping timeout: 370 seconds) |
05:28
🔗
|
|
underscor has joined #archiveteam |
05:28
🔗
|
|
swebb sets mode: +o underscor |
06:08
🔗
|
|
PepsiMax_ is now known as PepsiMax |
06:50
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
06:50
🔗
|
|
puddle has quit IRC (Quit: Connection closed for inactivity) |
06:52
🔗
|
|
nertzy2 has quit IRC (Quit: This computer has gone to sleep) |
06:56
🔗
|
|
hlndr has quit IRC (Read error: Operation timed out) |
07:00
🔗
|
|
hlndr has joined #archiveteam |
07:00
🔗
|
|
garyrh has quit IRC (http://bnc4free.com/) |
07:01
🔗
|
|
garyrh has joined #archiveteam |
07:02
🔗
|
|
primus104 has joined #archiveteam |
07:15
🔗
|
|
yipdw has quit IRC (Remote host closed the connection) |
07:15
🔗
|
|
yipdw has joined #archiveteam |
07:43
🔗
|
|
primus104 has quit IRC (Leaving.) |
07:53
🔗
|
|
atomotic has joined #archiveteam |
07:55
🔗
|
|
hlndr has quit IRC (Read error: Operation timed out) |
07:55
🔗
|
|
philpem has joined #archiveteam |
08:13
🔗
|
|
MMovie1 has joined #archiveteam |
08:16
🔗
|
|
MMovie has quit IRC (Ping timeout: 306 seconds) |
08:30
🔗
|
|
primus104 has joined #archiveteam |
08:43
🔗
|
|
schbirid2 has joined #archiveteam |
08:45
🔗
|
|
username1 has quit IRC (Read error: Operation timed out) |
09:03
🔗
|
|
X-Scale` is now known as X-Scale |
09:55
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:58
🔗
|
|
dashcloud has joined #archiveteam |
10:00
🔗
|
ersi |
closure: The curl people are nice and pretty reasonable~ |
10:17
🔗
|
|
wwwtxt has joined #archiveteam |
10:32
🔗
|
|
wwwtxt has quit IRC (Client Quit) |
10:33
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
10:50
🔗
|
|
john1 has quit IRC (Read error: Operation timed out) |
10:52
🔗
|
|
hlndr has joined #archiveteam |
10:58
🔗
|
|
hlndr has quit IRC (Ping timeout: 306 seconds) |
11:02
🔗
|
|
john1 has joined #archiveteam |
11:18
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
11:22
🔗
|
|
dashcloud has joined #archiveteam |
11:31
🔗
|
|
Ymgve has joined #archiveteam |
11:45
🔗
|
|
atomotic has joined #archiveteam |
12:21
🔗
|
|
primus104 has quit IRC (Leaving.) |
12:27
🔗
|
|
xtr-107 has quit IRC (Read error: Connection reset by peer) |
12:29
🔗
|
|
xtr-201 has joined #archiveteam |
12:30
🔗
|
|
username1 has joined #archiveteam |
12:32
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
12:41
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:46
🔗
|
|
bzc6p has joined #archiveteam |
12:48
🔗
|
|
signius has quit IRC (Read error: Operation timed out) |
12:49
🔗
|
|
nertzy2 has joined #archiveteam |
12:50
🔗
|
bzc6p |
pikhq: as far as I know this is the official wget bugtracking site: http://savannah.gnu.org/bugs/?group=wget |
12:50
🔗
|
bzc6p |
Reading your lines, I wonder if a bug that I filed is related to this one |
12:51
🔗
|
bzc6p |
http://savannah.gnu.org/bugs/?42794 |
12:51
🔗
|
bzc6p |
The encoding is UTF-8 but we still couldn't find the logic in the bug. |
12:53
🔗
|
bzc6p |
That was the moment (quite at the beginning) since I'm using wpull for website archivals. (I also miss the --retry-dns-error option from wget, which is crucial for me as I don't have a stable connection) |
12:55
🔗
|
|
mistym has joined #archiveteam |
12:55
🔗
|
|
nertzy2 has quit IRC (Read error: Operation timed out) |
12:59
🔗
|
|
bzc6p has quit IRC (bzc6p) |
13:02
🔗
|
|
mistym has quit IRC (Read error: Operation timed out) |
13:03
🔗
|
|
signius has joined #archiveteam |
13:36
🔗
|
|
Morbus has quit IRC (Quit: http://www.disobey.com/) |
13:55
🔗
|
|
mistym has joined #archiveteam |
13:59
🔗
|
|
Morbus has joined #archiveteam |
14:00
🔗
|
|
Start has quit IRC (Disconnected.) |
14:02
🔗
|
|
mistym has quit IRC (Read error: Operation timed out) |
14:11
🔗
|
|
Ravenloft has quit IRC (Read error: Connection reset by peer) |
14:17
🔗
|
|
sankin has joined #archiveteam |
14:30
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
14:31
🔗
|
|
garyrh has quit IRC (Ping timeout: 619 seconds) |
14:33
🔗
|
|
dashcloud has joined #archiveteam |
14:34
🔗
|
|
useretail has quit IRC (Ping timeout: 619 seconds) |
14:37
🔗
|
|
Start has joined #archiveteam |
14:38
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
14:41
🔗
|
|
yipdw has quit IRC (Ping timeout: 255 seconds) |
14:41
🔗
|
|
swebb has quit IRC (Ping timeout: 255 seconds) |
14:41
🔗
|
|
midas has quit IRC (Ping timeout: 255 seconds) |
14:41
🔗
|
|
swebb has joined #archiveteam |
14:43
🔗
|
|
yipdw has joined #archiveteam |
14:44
🔗
|
|
dashcloud has quit IRC (Ping timeout: 483 seconds) |
14:46
🔗
|
|
midas has joined #archiveteam |
14:52
🔗
|
|
dashcloud has joined #archiveteam |
14:55
🔗
|
|
Start has quit IRC (Disconnected.) |
14:57
🔗
|
|
useretail has joined #archiveteam |
14:58
🔗
|
|
Start has joined #archiveteam |
14:59
🔗
|
|
mistym has joined #archiveteam |
15:14
🔗
|
|
primus104 has joined #archiveteam |
15:19
🔗
|
|
primus105 has joined #archiveteam |
15:22
🔗
|
|
primus105 has quit IRC (Client Quit) |
15:25
🔗
|
DFJustin |
someone should probably grab this stuff, archivebot isn't working on the site http://www.dni.gov/index.php/resources/bin-laden-bookshelf |
15:27
🔗
|
|
primus104 has quit IRC (Read error: Operation timed out) |
15:36
🔗
|
|
Emcy has quit IRC (Read error: Connection reset by peer) |
15:38
🔗
|
|
Jonimus has quit IRC (Ping timeout: 370 seconds) |
15:51
🔗
|
|
Start has quit IRC (Disconnected.) |
15:52
🔗
|
|
SmileyG has joined #archiveteam |
15:53
🔗
|
|
tephra_ has joined #archiveteam |
15:53
🔗
|
|
Quile_ has joined #archiveteam |
15:56
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
15:56
🔗
|
|
thechip_ has joined #archiveteam |
16:01
🔗
|
|
tephra has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
Smiley has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
tsp_ has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
thechip has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
wm_ has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
dugo has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
Marc has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
raylee has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
Quile has quit IRC (hub.se irc.underworld.no) |
16:01
🔗
|
|
Atluxity has quit IRC (hub.se irc.underworld.no) |
16:10
🔗
|
|
dugo_ has joined #archiveteam |
16:10
🔗
|
|
mistym has joined #archiveteam |
16:24
🔗
|
|
philpem has quit IRC (Remote host closed the connection) |
16:26
🔗
|
|
Emcy has joined #archiveteam |
16:29
🔗
|
|
Ravenloft has joined #archiveteam |
16:31
🔗
|
|
SimpBrain has joined #archiveteam |
16:31
🔗
|
|
twrist has joined #archiveteam |
16:35
🔗
|
|
garyrh has joined #archiveteam |
16:35
🔗
|
|
primus104 has joined #archiveteam |
16:41
🔗
|
|
aaaaaaaaa has joined #archiveteam |
16:52
🔗
|
|
Start has joined #archiveteam |
17:42
🔗
|
|
Start has quit IRC (Disconnected.) |
17:49
🔗
|
|
aNthraXx_ has joined #archiveteam |
17:52
🔗
|
|
aNthraXx has quit IRC (Read error: Operation timed out) |
17:53
🔗
|
|
cadbury_ has quit IRC (Ping timeout: 606 seconds) |
17:53
🔗
|
|
brayden_ has quit IRC (Ping timeout: 606 seconds) |
17:56
🔗
|
|
caber has quit IRC (Ping timeout: 606 seconds) |
17:59
🔗
|
|
aNthraXx_ has quit IRC (Read error: Operation timed out) |
17:59
🔗
|
|
caber has joined #archiveteam |
18:00
🔗
|
|
aNthraXx has joined #archiveteam |
18:04
🔗
|
|
cadbury_ has joined #archiveteam |
18:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:13
🔗
|
|
habi has joined #archiveteam |
18:14
🔗
|
|
dashcloud has joined #archiveteam |
18:17
🔗
|
|
raylee has joined #archiveteam |
18:17
🔗
|
|
wm_ has joined #archiveteam |
18:22
🔗
|
|
Emcy_ has joined #archiveteam |
18:29
🔗
|
|
Emcy has quit IRC (Ping timeout: 512 seconds) |
18:33
🔗
|
|
hlndr has joined #archiveteam |
18:33
🔗
|
|
twrist has quit IRC (And now, for my next magic trick..) |
18:37
🔗
|
|
primus104 has quit IRC (Leaving.) |
18:53
🔗
|
|
sankin has quit IRC (Leaving.) |
19:00
🔗
|
|
Emcy_ has quit IRC (Read error: Connection reset by peer) |
19:05
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
19:10
🔗
|
|
primus104 has joined #archiveteam |
19:20
🔗
|
|
mistym has joined #archiveteam |
19:43
🔗
|
username1 |
https://github.com/venomous0x/WhatsAPI |
19:54
🔗
|
yipdw |
I have a copy |
19:54
🔗
|
yipdw |
https://github.com/yipdw/WhatsAPI/commits/master |
19:55
🔗
|
yipdw |
as do 2,467 others |
19:55
🔗
|
yipdw |
er sorry 1,921 others |
19:55
🔗
|
yipdw |
that said, fuck WhatsApp |
19:58
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:08
🔗
|
db48x |
cloned |
20:11
🔗
|
balrog |
yipdw: is that up to date? |
20:13
🔗
|
|
mistym has joined #archiveteam |
20:15
🔗
|
yipdw |
balrog: probably not |
20:15
🔗
|
|
mistym has quit IRC (Read error: Connection reset by peer) |
20:15
🔗
|
yipdw |
you'll want to comb the other 1,921 clones to check |
20:16
🔗
|
|
mistym has joined #archiveteam |
20:16
🔗
|
db48x |
https://github.com/15786548135/WhatsAPI/commits/master |
20:17
🔗
|
db48x |
https://github.com/7aduta/WhatsAPI/commits/master |
20:23
🔗
|
|
habi has left |
20:26
🔗
|
db48x |
Array.prototype.map.call(document.querySelectorAll("div.repo>a:nth-of-type(2)"), function (e) { return "git add remote "+ (e.href.match(/com\/([^\/]*)\//)[1]) +" "+ e.href +".git"; }); |
20:27
🔗
|
|
human39_ has joined #archiveteam |
20:28
🔗
|
db48x |
document.documentElement.innerHTML = Array.prototype.map.call(document.querySelectorAll("div.repo>a:nth-of-type(2)"), function (e) { return "git add remote "+ (e.href.match(/com\/([^\/]*)\//)[1]) +" "+ e.href +".git"; }).join("<br>") |
20:28
🔗
|
db48x |
it's only 1k of the 1.9k remotes though |
20:29
🔗
|
|
Start has joined #archiveteam |
20:32
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:48
🔗
|
|
mistym has joined #archiveteam |
20:50
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:53
🔗
|
|
kyan has joined #archiveteam |
20:59
🔗
|
db48x |
my net connection is acting up... |
21:03
🔗
|
|
mistym has joined #archiveteam |
21:04
🔗
|
|
Rickster has quit IRC (Quit: ZNC - http://znc.in) |
21:08
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
21:09
🔗
|
|
Rickster has joined #archiveteam |
21:14
🔗
|
|
mistym has joined #archiveteam |
21:16
🔗
|
|
Start has quit IRC (Disconnected.) |
21:33
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
21:35
🔗
|
|
BlueMaxim has joined #archiveteam |
21:44
🔗
|
|
mistym has joined #archiveteam |
22:01
🔗
|
db48x |
interesting, the api only gives me 1822 |
22:03
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
22:05
🔗
|
|
mistym has joined #archiveteam |
22:08
🔗
|
|
n00b169 has joined #archiveteam |
22:10
🔗
|
|
n00b169 has quit IRC (Client Quit) |
22:13
🔗
|
|
yuvadm has joined #archiveteam |
22:14
🔗
|
yuvadm |
looking for some advice on frameworks i can use to scrape the hell out of a blogging platform thats going down |
22:14
🔗
|
yuvadm |
before i start NIH'ing some code |
22:21
🔗
|
|
toad1 has joined #archiveteam |
22:22
🔗
|
xmc |
knights of nih |
22:22
🔗
|
|
toad2 has quit IRC (Read error: Operation timed out) |
22:29
🔗
|
|
rumbles has joined #archiveteam |
22:29
🔗
|
yuvadm |
XD |
22:29
🔗
|
rumbles |
@yipdw does archivebot support parsing a json payload of urls for processing? |
22:29
🔗
|
rumbles |
Url in question: https://api.github.com/repos/venomous0x/WhatsAPI/pulls?state=open |
22:29
🔗
|
|
Emcy has joined #archiveteam |
22:30
🔗
|
yuvadm |
heh, nice |
22:33
🔗
|
db48x |
yuvadm: wpull |
22:34
🔗
|
db48x |
rumbles: every pull request has an associated git ref |
22:34
🔗
|
yuvadm |
db48x: bless you, exactly what i need |
22:34
🔗
|
yuvadm |
py<3 |
22:34
🔗
|
|
nertzy has joined #archiveteam |
22:35
🔗
|
db48x |
you're welcome |
22:35
🔗
|
DFJustin |
yuvadm: what's the blogging platform, maybe it should be an archive team project |
22:35
🔗
|
rumbles |
@db48x: thanks! |
22:36
🔗
|
yuvadm |
DFJustin: i'd love for that to happen, but there's an i18n barrier, it's all in hebrew |
22:36
🔗
|
yuvadm |
israblog.co.il |
22:36
🔗
|
db48x |
you're welcome |
22:36
🔗
|
yuvadm |
largest israeli blogging platform since way back |
22:37
🔗
|
db48x |
yuvadm: sounds like a good candidate for the warrior then |
22:37
🔗
|
db48x |
http://tracker.archiveteam.org/ |
22:37
🔗
|
yuvadm |
why's warrior the better option in this case? |
22:37
🔗
|
db48x |
it's distributed, see http://tracker.archiveteam.org/furaffinity/ for an example |
22:38
🔗
|
rumbles |
distributed = less likely for extraction to be banned/throttled |
22:38
🔗
|
yuvadm |
what's the input for warrior? a WARC? |
22:38
🔗
|
db48x |
a list of the tasks to do |
22:39
🔗
|
db48x |
profile:kaafan33 and submission:15872951-15873000, for example |
22:39
🔗
|
yuvadm |
cool |
22:39
🔗
|
yuvadm |
i'll take alook see if it fits the bill. who authorizes tasks for the warrior? |
22:40
🔗
|
yuvadm |
this is a pretty large project |
22:40
🔗
|
db48x |
each site we're working on has a separate git repository where the source for the pipeline is kept |
22:40
🔗
|
|
Start has joined #archiveteam |
22:40
🔗
|
db48x |
https://github.com/ArchiveTeam/furaffinity-grab |
22:41
🔗
|
db48x |
looking at this one I see that it actually uses wpull |
22:41
🔗
|
db48x |
https://github.com/ArchiveTeam/furaffinity-grab/blob/master2/pipeline.py#L193 |
22:41
🔗
|
yuvadm |
db48x: that's awesome. gotta go afk, but i'll be back with more Q's for sure |
22:42
🔗
|
rumbles |
db48x would you accept a PR for a Dockerfile to build pipelines if I built one? |
22:42
🔗
|
db48x |
you can see how it looks at the job ID to decide what to do; https://github.com/ArchiveTeam/furaffinity-grab/blob/master2/pipeline.py#L226 |
22:42
🔗
|
db48x |
yuvadm: sure, I'll be in and out as well |
22:42
🔗
|
db48x |
rumbles: possibly |
22:44
🔗
|
rumbles |
thanks! |
22:58
🔗
|
|
rumbles has quit IRC (Quit: Page closed) |
23:03
🔗
|
|
REiN^ has quit IRC (Read error: Operation timed out) |
23:04
🔗
|
|
REiN^ has joined #archiveteam |
23:05
🔗
|
|
cadbury_ has quit IRC (Read error: Operation timed out) |
23:06
🔗
|
|
dinomite_ has joined #archiveteam |
23:07
🔗
|
|
Jonimus has joined #archiveteam |
23:08
🔗
|
|
dinomite has quit IRC (Read error: Connection reset by peer) |
23:08
🔗
|
yipdw |
rumbles: no, but cat https://api.github.com/repos/venomous0x/WhatsAPI/pulls?state=open | jq '.[].url' > FILE works |
23:11
🔗
|
|
aNthraXx has quit IRC (Read error: No route to host) |
23:13
🔗
|
aaaaaaaaa |
he left, so he may ask again later |
23:14
🔗
|
yipdw |
db48x: oh yeah we have a dockerfile already |
23:14
🔗
|
yipdw |
heh |
23:17
🔗
|
|
Sk1d has quit IRC (Ping timeout: 606 seconds) |
23:18
🔗
|
|
Sk1d has joined #archiveteam |
23:23
🔗
|
|
cadbury_ has joined #archiveteam |
23:23
🔗
|
|
aNthraXx has joined #archiveteam |
23:23
🔗
|
|
REiN^ has quit IRC (Ping timeout: 370 seconds) |
23:27
🔗
|
|
lexicon has joined #archiveteam |
23:35
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
23:38
🔗
|
|
REiN^ has joined #archiveteam |
23:42
🔗
|
|
Sellyme has quit IRC (No Ping reply in 180 seconds.) |
23:44
🔗
|
|
Sellyme has joined #archiveteam |
23:54
🔗
|
|
SimpBrain has quit IRC (Ping timeout: 258 seconds) |