#archiveteam 2015-05-20,Wed

↑back Search

Time Nickname Message
00:00 🔗 cadbury_ has joined #archiveteam
00:17 🔗 mistym has quit IRC (Remote host closed the connection)
00:42 🔗 philpem has quit IRC (Ping timeout: 252 seconds)
01:03 🔗 balrog has quit IRC (Read error: Operation timed out)
01:04 🔗 mistym has joined #archiveteam
01:07 🔗 balrog has joined #archiveteam
01:07 🔗 swebb sets mode: +o balrog
01:13 🔗 boozehoun has quit IRC (Read error: Operation timed out)
01:25 🔗 username1 has joined #archiveteam
01:27 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
01:29 🔗 zenguy_pc has joined #archiveteam
01:29 🔗 BlueMaxim has joined #archiveteam
01:37 🔗 zenguy_pc has quit IRC (Quit: Leaving)
01:39 🔗 primus104 has quit IRC (Leaving.)
01:40 🔗 zenguy_pc has joined #archiveteam
01:46 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
01:47 🔗 pikhq Fun stuff. I found a bug in wget's mirroring logic...
01:48 🔗 pikhq It appears it doesn't look at any charset info for any HTML file. Which means if for some reason your website is using UTF-16 (... shockingly I found something that does that), it doesn't work right.
01:51 🔗 pikhq Ah, no, that's not quite it. This is a website that is putting out UTF-16 without any indication of the charset, and wget doesn't heuristic the charset.
01:52 🔗 pikhq "Fun".
01:59 🔗 pikhq Okay, then. *When* you tell it the remote charset is UTF-16 it still looks for ASCII patterns to try and pick out URLs.
02:01 🔗 pikhq Time to find how to report a bug to wget.
02:17 🔗 lytv has quit IRC (Read error: Operation timed out)
02:20 🔗 lytv has joined #archiveteam
02:33 🔗 X-Scale` has joined #archiveteam
02:37 🔗 X-Scale has quit IRC (Ping timeout: 506 seconds)
02:37 🔗 Stiletto has joined #archiveteam
03:04 🔗 db48x has joined #archiveteam
03:06 🔗 Ravenloft has joined #archiveteam
04:06 🔗 SN4T14__ has joined #archiveteam
04:11 🔗 SN4T14_ has quit IRC (Ping timeout: 306 seconds)
04:13 🔗 DFJustin yeah utf-16 has never worked with wget I think, try wpull
04:15 🔗 pikhq Still a bug. :)
04:17 🔗 * closure has filed bugs on both wget and curl this year. they fixed the curl one. wget one can get it to delete a file that it's not supposed to touch..
04:18 🔗 closure got curl to behave sensibly when downloading empty files, at last :)
04:18 🔗 pikhq DFJustin: From the sounds of it wpull looks rather a lot nicer for some stuff.
04:24 🔗 mistym has quit IRC (Remote host closed the connection)
04:41 🔗 aaaaaaaaa has quit IRC (Leaving)
04:44 🔗 mistym has joined #archiveteam
05:06 🔗 underscor has quit IRC (Ping timeout: 370 seconds)
05:28 🔗 underscor has joined #archiveteam
05:28 🔗 swebb sets mode: +o underscor
06:08 🔗 PepsiMax_ is now known as PepsiMax
06:50 🔗 mistym has quit IRC (Remote host closed the connection)
06:50 🔗 puddle has quit IRC (Quit: Connection closed for inactivity)
06:52 🔗 nertzy2 has quit IRC (Quit: This computer has gone to sleep)
06:56 🔗 hlndr has quit IRC (Read error: Operation timed out)
07:00 🔗 hlndr has joined #archiveteam
07:00 🔗 garyrh has quit IRC (http://bnc4free.com/)
07:01 🔗 garyrh has joined #archiveteam
07:02 🔗 primus104 has joined #archiveteam
07:15 🔗 yipdw has quit IRC (Remote host closed the connection)
07:15 🔗 yipdw has joined #archiveteam
07:43 🔗 primus104 has quit IRC (Leaving.)
07:53 🔗 atomotic has joined #archiveteam
07:55 🔗 hlndr has quit IRC (Read error: Operation timed out)
07:55 🔗 philpem has joined #archiveteam
08:13 🔗 MMovie1 has joined #archiveteam
08:16 🔗 MMovie has quit IRC (Ping timeout: 306 seconds)
08:30 🔗 primus104 has joined #archiveteam
08:43 🔗 schbirid2 has joined #archiveteam
08:45 🔗 username1 has quit IRC (Read error: Operation timed out)
09:03 🔗 X-Scale` is now known as X-Scale
09:55 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:58 🔗 dashcloud has joined #archiveteam
10:00 🔗 ersi closure: The curl people are nice and pretty reasonable~
10:17 🔗 wwwtxt has joined #archiveteam
10:32 🔗 wwwtxt has quit IRC (Client Quit)
10:33 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
10:50 🔗 john1 has quit IRC (Read error: Operation timed out)
10:52 🔗 hlndr has joined #archiveteam
10:58 🔗 hlndr has quit IRC (Ping timeout: 306 seconds)
11:02 🔗 john1 has joined #archiveteam
11:18 🔗 dashcloud has quit IRC (Read error: Operation timed out)
11:22 🔗 dashcloud has joined #archiveteam
11:31 🔗 Ymgve has joined #archiveteam
11:45 🔗 atomotic has joined #archiveteam
12:21 🔗 primus104 has quit IRC (Leaving.)
12:27 🔗 xtr-107 has quit IRC (Read error: Connection reset by peer)
12:29 🔗 xtr-201 has joined #archiveteam
12:30 🔗 username1 has joined #archiveteam
12:32 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
12:41 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:46 🔗 bzc6p has joined #archiveteam
12:48 🔗 signius has quit IRC (Read error: Operation timed out)
12:49 🔗 nertzy2 has joined #archiveteam
12:50 🔗 bzc6p pikhq: as far as I know this is the official wget bugtracking site: http://savannah.gnu.org/bugs/?group=wget
12:50 🔗 bzc6p Reading your lines, I wonder if a bug that I filed is related to this one
12:51 🔗 bzc6p http://savannah.gnu.org/bugs/?42794
12:51 🔗 bzc6p The encoding is UTF-8 but we still couldn't find the logic in the bug.
12:53 🔗 bzc6p That was the moment (quite at the beginning) since I'm using wpull for website archivals. (I also miss the --retry-dns-error option from wget, which is crucial for me as I don't have a stable connection)
12:55 🔗 mistym has joined #archiveteam
12:55 🔗 nertzy2 has quit IRC (Read error: Operation timed out)
12:59 🔗 bzc6p has quit IRC (bzc6p)
13:02 🔗 mistym has quit IRC (Read error: Operation timed out)
13:03 🔗 signius has joined #archiveteam
13:36 🔗 Morbus has quit IRC (Quit: http://www.disobey.com/)
13:55 🔗 mistym has joined #archiveteam
13:59 🔗 Morbus has joined #archiveteam
14:00 🔗 Start has quit IRC (Disconnected.)
14:02 🔗 mistym has quit IRC (Read error: Operation timed out)
14:11 🔗 Ravenloft has quit IRC (Read error: Connection reset by peer)
14:17 🔗 sankin has joined #archiveteam
14:30 🔗 dashcloud has quit IRC (Read error: Operation timed out)
14:31 🔗 garyrh has quit IRC (Ping timeout: 619 seconds)
14:33 🔗 dashcloud has joined #archiveteam
14:34 🔗 useretail has quit IRC (Ping timeout: 619 seconds)
14:37 🔗 Start has joined #archiveteam
14:38 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
14:41 🔗 yipdw has quit IRC (Ping timeout: 255 seconds)
14:41 🔗 swebb has quit IRC (Ping timeout: 255 seconds)
14:41 🔗 midas has quit IRC (Ping timeout: 255 seconds)
14:41 🔗 swebb has joined #archiveteam
14:43 🔗 yipdw has joined #archiveteam
14:44 🔗 dashcloud has quit IRC (Ping timeout: 483 seconds)
14:46 🔗 midas has joined #archiveteam
14:52 🔗 dashcloud has joined #archiveteam
14:55 🔗 Start has quit IRC (Disconnected.)
14:57 🔗 useretail has joined #archiveteam
14:58 🔗 Start has joined #archiveteam
14:59 🔗 mistym has joined #archiveteam
15:14 🔗 primus104 has joined #archiveteam
15:19 🔗 primus105 has joined #archiveteam
15:22 🔗 primus105 has quit IRC (Client Quit)
15:25 🔗 DFJustin someone should probably grab this stuff, archivebot isn't working on the site http://www.dni.gov/index.php/resources/bin-laden-bookshelf
15:27 🔗 primus104 has quit IRC (Read error: Operation timed out)
15:36 🔗 Emcy has quit IRC (Read error: Connection reset by peer)
15:38 🔗 Jonimus has quit IRC (Ping timeout: 370 seconds)
15:51 🔗 Start has quit IRC (Disconnected.)
15:52 🔗 SmileyG has joined #archiveteam
15:53 🔗 tephra_ has joined #archiveteam
15:53 🔗 Quile_ has joined #archiveteam
15:56 🔗 mistym has quit IRC (Remote host closed the connection)
15:56 🔗 thechip_ has joined #archiveteam
16:01 🔗 tephra has quit IRC (hub.se irc.underworld.no)
16:01 🔗 Smiley has quit IRC (hub.se irc.underworld.no)
16:01 🔗 tsp_ has quit IRC (hub.se irc.underworld.no)
16:01 🔗 thechip has quit IRC (hub.se irc.underworld.no)
16:01 🔗 wm_ has quit IRC (hub.se irc.underworld.no)
16:01 🔗 dugo has quit IRC (hub.se irc.underworld.no)
16:01 🔗 Marc has quit IRC (hub.se irc.underworld.no)
16:01 🔗 raylee has quit IRC (hub.se irc.underworld.no)
16:01 🔗 Quile has quit IRC (hub.se irc.underworld.no)
16:01 🔗 Atluxity has quit IRC (hub.se irc.underworld.no)
16:10 🔗 dugo_ has joined #archiveteam
16:10 🔗 mistym has joined #archiveteam
16:24 🔗 philpem has quit IRC (Remote host closed the connection)
16:26 🔗 Emcy has joined #archiveteam
16:29 🔗 Ravenloft has joined #archiveteam
16:31 🔗 SimpBrain has joined #archiveteam
16:31 🔗 twrist has joined #archiveteam
16:35 🔗 garyrh has joined #archiveteam
16:35 🔗 primus104 has joined #archiveteam
16:41 🔗 aaaaaaaaa has joined #archiveteam
16:52 🔗 Start has joined #archiveteam
17:42 🔗 Start has quit IRC (Disconnected.)
17:49 🔗 aNthraXx_ has joined #archiveteam
17:52 🔗 aNthraXx has quit IRC (Read error: Operation timed out)
17:53 🔗 cadbury_ has quit IRC (Ping timeout: 606 seconds)
17:53 🔗 brayden_ has quit IRC (Ping timeout: 606 seconds)
17:56 🔗 caber has quit IRC (Ping timeout: 606 seconds)
17:59 🔗 aNthraXx_ has quit IRC (Read error: Operation timed out)
17:59 🔗 caber has joined #archiveteam
18:00 🔗 aNthraXx has joined #archiveteam
18:04 🔗 cadbury_ has joined #archiveteam
18:10 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:13 🔗 habi has joined #archiveteam
18:14 🔗 dashcloud has joined #archiveteam
18:17 🔗 raylee has joined #archiveteam
18:17 🔗 wm_ has joined #archiveteam
18:22 🔗 Emcy_ has joined #archiveteam
18:29 🔗 Emcy has quit IRC (Ping timeout: 512 seconds)
18:33 🔗 hlndr has joined #archiveteam
18:33 🔗 twrist has quit IRC (And now, for my next magic trick..)
18:37 🔗 primus104 has quit IRC (Leaving.)
18:53 🔗 sankin has quit IRC (Leaving.)
19:00 🔗 Emcy_ has quit IRC (Read error: Connection reset by peer)
19:05 🔗 mistym has quit IRC (Remote host closed the connection)
19:10 🔗 primus104 has joined #archiveteam
19:20 🔗 mistym has joined #archiveteam
19:43 🔗 username1 https://github.com/venomous0x/WhatsAPI
19:54 🔗 yipdw I have a copy
19:54 🔗 yipdw https://github.com/yipdw/WhatsAPI/commits/master
19:55 🔗 yipdw as do 2,467 others
19:55 🔗 yipdw er sorry 1,921 others
19:55 🔗 yipdw that said, fuck WhatsApp
19:58 🔗 mistym has quit IRC (Remote host closed the connection)
20:08 🔗 db48x cloned
20:11 🔗 balrog yipdw: is that up to date?
20:13 🔗 mistym has joined #archiveteam
20:15 🔗 yipdw balrog: probably not
20:15 🔗 mistym has quit IRC (Read error: Connection reset by peer)
20:15 🔗 yipdw you'll want to comb the other 1,921 clones to check
20:16 🔗 mistym has joined #archiveteam
20:16 🔗 db48x https://github.com/15786548135/WhatsAPI/commits/master
20:17 🔗 db48x https://github.com/7aduta/WhatsAPI/commits/master
20:23 🔗 habi has left
20:26 🔗 db48x Array.prototype.map.call(document.querySelectorAll("div.repo>a:nth-of-type(2)"), function (e) { return "git add remote "+ (e.href.match(/com\/([^\/]*)\//)[1]) +" "+ e.href +".git"; });
20:27 🔗 human39_ has joined #archiveteam
20:28 🔗 db48x document.documentElement.innerHTML = Array.prototype.map.call(document.querySelectorAll("div.repo>a:nth-of-type(2)"), function (e) { return "git add remote "+ (e.href.match(/com\/([^\/]*)\//)[1]) +" "+ e.href +".git"; }).join("<br>")
20:28 🔗 db48x it's only 1k of the 1.9k remotes though
20:29 🔗 Start has joined #archiveteam
20:32 🔗 mistym has quit IRC (Remote host closed the connection)
20:48 🔗 mistym has joined #archiveteam
20:50 🔗 mistym has quit IRC (Remote host closed the connection)
20:53 🔗 kyan has joined #archiveteam
20:59 🔗 db48x my net connection is acting up...
21:03 🔗 mistym has joined #archiveteam
21:04 🔗 Rickster has quit IRC (Quit: ZNC - http://znc.in)
21:08 🔗 mistym has quit IRC (Remote host closed the connection)
21:09 🔗 Rickster has joined #archiveteam
21:14 🔗 mistym has joined #archiveteam
21:16 🔗 Start has quit IRC (Disconnected.)
21:33 🔗 mistym has quit IRC (Remote host closed the connection)
21:35 🔗 BlueMaxim has joined #archiveteam
21:44 🔗 mistym has joined #archiveteam
22:01 🔗 db48x interesting, the api only gives me 1822
22:03 🔗 mistym has quit IRC (Remote host closed the connection)
22:05 🔗 mistym has joined #archiveteam
22:08 🔗 n00b169 has joined #archiveteam
22:10 🔗 n00b169 has quit IRC (Client Quit)
22:13 🔗 yuvadm has joined #archiveteam
22:14 🔗 yuvadm looking for some advice on frameworks i can use to scrape the hell out of a blogging platform thats going down
22:14 🔗 yuvadm before i start NIH'ing some code
22:21 🔗 toad1 has joined #archiveteam
22:22 🔗 xmc knights of nih
22:22 🔗 toad2 has quit IRC (Read error: Operation timed out)
22:29 🔗 rumbles has joined #archiveteam
22:29 🔗 yuvadm XD
22:29 🔗 rumbles @yipdw does archivebot support parsing a json payload of urls for processing?
22:29 🔗 rumbles Url in question: https://api.github.com/repos/venomous0x/WhatsAPI/pulls?state=open
22:29 🔗 Emcy has joined #archiveteam
22:30 🔗 yuvadm heh, nice
22:33 🔗 db48x yuvadm: wpull
22:34 🔗 db48x rumbles: every pull request has an associated git ref
22:34 🔗 yuvadm db48x: bless you, exactly what i need
22:34 🔗 yuvadm py<3
22:34 🔗 nertzy has joined #archiveteam
22:35 🔗 db48x you're welcome
22:35 🔗 DFJustin yuvadm: what's the blogging platform, maybe it should be an archive team project
22:35 🔗 rumbles @db48x: thanks!
22:36 🔗 yuvadm DFJustin: i'd love for that to happen, but there's an i18n barrier, it's all in hebrew
22:36 🔗 yuvadm israblog.co.il
22:36 🔗 db48x you're welcome
22:36 🔗 yuvadm largest israeli blogging platform since way back
22:37 🔗 db48x yuvadm: sounds like a good candidate for the warrior then
22:37 🔗 db48x http://tracker.archiveteam.org/
22:37 🔗 yuvadm why's warrior the better option in this case?
22:37 🔗 db48x it's distributed, see http://tracker.archiveteam.org/furaffinity/ for an example
22:38 🔗 rumbles distributed = less likely for extraction to be banned/throttled
22:38 🔗 yuvadm what's the input for warrior? a WARC?
22:38 🔗 db48x a list of the tasks to do
22:39 🔗 db48x profile:kaafan33 and submission:15872951-15873000, for example
22:39 🔗 yuvadm cool
22:39 🔗 yuvadm i'll take alook see if it fits the bill. who authorizes tasks for the warrior?
22:40 🔗 yuvadm this is a pretty large project
22:40 🔗 db48x each site we're working on has a separate git repository where the source for the pipeline is kept
22:40 🔗 Start has joined #archiveteam
22:40 🔗 db48x https://github.com/ArchiveTeam/furaffinity-grab
22:41 🔗 db48x looking at this one I see that it actually uses wpull
22:41 🔗 db48x https://github.com/ArchiveTeam/furaffinity-grab/blob/master2/pipeline.py#L193
22:41 🔗 yuvadm db48x: that's awesome. gotta go afk, but i'll be back with more Q's for sure
22:42 🔗 rumbles db48x would you accept a PR for a Dockerfile to build pipelines if I built one?
22:42 🔗 db48x you can see how it looks at the job ID to decide what to do; https://github.com/ArchiveTeam/furaffinity-grab/blob/master2/pipeline.py#L226
22:42 🔗 db48x yuvadm: sure, I'll be in and out as well
22:42 🔗 db48x rumbles: possibly
22:44 🔗 rumbles thanks!
22:58 🔗 rumbles has quit IRC (Quit: Page closed)
23:03 🔗 REiN^ has quit IRC (Read error: Operation timed out)
23:04 🔗 REiN^ has joined #archiveteam
23:05 🔗 cadbury_ has quit IRC (Read error: Operation timed out)
23:06 🔗 dinomite_ has joined #archiveteam
23:07 🔗 Jonimus has joined #archiveteam
23:08 🔗 dinomite has quit IRC (Read error: Connection reset by peer)
23:08 🔗 yipdw rumbles: no, but cat https://api.github.com/repos/venomous0x/WhatsAPI/pulls?state=open | jq '.[].url' > FILE works
23:11 🔗 aNthraXx has quit IRC (Read error: No route to host)
23:13 🔗 aaaaaaaaa he left, so he may ask again later
23:14 🔗 yipdw db48x: oh yeah we have a dockerfile already
23:14 🔗 yipdw heh
23:17 🔗 Sk1d has quit IRC (Ping timeout: 606 seconds)
23:18 🔗 Sk1d has joined #archiveteam
23:23 🔗 cadbury_ has joined #archiveteam
23:23 🔗 aNthraXx has joined #archiveteam
23:23 🔗 REiN^ has quit IRC (Ping timeout: 370 seconds)
23:27 🔗 lexicon has joined #archiveteam
23:35 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
23:38 🔗 REiN^ has joined #archiveteam
23:42 🔗 Sellyme has quit IRC (No Ping reply in 180 seconds.)
23:44 🔗 Sellyme has joined #archiveteam
23:54 🔗 SimpBrain has quit IRC (Ping timeout: 258 seconds)

irclogger-viewer