[00:00] *** cadbury_ has joined #archiveteam [00:17] *** mistym has quit IRC (Remote host closed the connection) [00:42] *** philpem has quit IRC (Ping timeout: 252 seconds) [01:03] *** balrog has quit IRC (Read error: Operation timed out) [01:04] *** mistym has joined #archiveteam [01:07] *** balrog has joined #archiveteam [01:07] *** swebb sets mode: +o balrog [01:13] *** boozehoun has quit IRC (Read error: Operation timed out) [01:25] *** username1 has joined #archiveteam [01:27] *** schbirid2 has quit IRC (Read error: Operation timed out) [01:29] *** zenguy_pc has joined #archiveteam [01:29] *** BlueMaxim has joined #archiveteam [01:37] *** zenguy_pc has quit IRC (Quit: Leaving) [01:39] *** primus104 has quit IRC (Leaving.) [01:40] *** zenguy_pc has joined #archiveteam [01:46] *** VADemon has quit IRC (Read error: Connection reset by peer) [01:47] Fun stuff. I found a bug in wget's mirroring logic... [01:48] It appears it doesn't look at any charset info for any HTML file. Which means if for some reason your website is using UTF-16 (... shockingly I found something that does that), it doesn't work right. [01:51] Ah, no, that's not quite it. This is a website that is putting out UTF-16 without any indication of the charset, and wget doesn't heuristic the charset. [01:52] "Fun". [01:59] Okay, then. *When* you tell it the remote charset is UTF-16 it still looks for ASCII patterns to try and pick out URLs. [02:01] Time to find how to report a bug to wget. [02:17] *** lytv has quit IRC (Read error: Operation timed out) [02:20] *** lytv has joined #archiveteam [02:33] *** X-Scale` has joined #archiveteam [02:37] *** X-Scale has quit IRC (Ping timeout: 506 seconds) [02:37] *** Stiletto has joined #archiveteam [03:04] *** db48x has joined #archiveteam [03:06] *** Ravenloft has joined #archiveteam [04:06] *** SN4T14__ has joined #archiveteam [04:11] *** SN4T14_ has quit IRC (Ping timeout: 306 seconds) [04:13] yeah utf-16 has never worked with wget I think, try wpull [04:15] Still a bug. :) [04:17] * closure has filed bugs on both wget and curl this year. they fixed the curl one. wget one can get it to delete a file that it's not supposed to touch.. [04:18] got curl to behave sensibly when downloading empty files, at last :) [04:18] DFJustin: From the sounds of it wpull looks rather a lot nicer for some stuff. [04:24] *** mistym has quit IRC (Remote host closed the connection) [04:41] *** aaaaaaaaa has quit IRC (Leaving) [04:44] *** mistym has joined #archiveteam [05:06] *** underscor has quit IRC (Ping timeout: 370 seconds) [05:28] *** underscor has joined #archiveteam [05:28] *** swebb sets mode: +o underscor [06:08] *** PepsiMax_ is now known as PepsiMax [06:50] *** mistym has quit IRC (Remote host closed the connection) [06:50] *** puddle has quit IRC (Quit: Connection closed for inactivity) [06:52] *** nertzy2 has quit IRC (Quit: This computer has gone to sleep) [06:56] *** hlndr has quit IRC (Read error: Operation timed out) [07:00] *** hlndr has joined #archiveteam [07:00] *** garyrh has quit IRC (http://bnc4free.com/) [07:01] *** garyrh has joined #archiveteam [07:02] *** primus104 has joined #archiveteam [07:15] *** yipdw has quit IRC (Remote host closed the connection) [07:15] *** yipdw has joined #archiveteam [07:43] *** primus104 has quit IRC (Leaving.) [07:53] *** atomotic has joined #archiveteam [07:55] *** hlndr has quit IRC (Read error: Operation timed out) [07:55] *** philpem has joined #archiveteam [08:13] *** MMovie1 has joined #archiveteam [08:16] *** MMovie has quit IRC (Ping timeout: 306 seconds) [08:30] *** primus104 has joined #archiveteam [08:43] *** schbirid2 has joined #archiveteam [08:45] *** username1 has quit IRC (Read error: Operation timed out) [09:03] *** X-Scale` is now known as X-Scale [09:55] *** dashcloud has quit IRC (Read error: Operation timed out) [09:58] *** dashcloud has joined #archiveteam [10:00] closure: The curl people are nice and pretty reasonable~ [10:17] *** wwwtxt has joined #archiveteam [10:32] *** wwwtxt has quit IRC (Client Quit) [10:33] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [10:50] *** john1 has quit IRC (Read error: Operation timed out) [10:52] *** hlndr has joined #archiveteam [10:58] *** hlndr has quit IRC (Ping timeout: 306 seconds) [11:02] *** john1 has joined #archiveteam [11:18] *** dashcloud has quit IRC (Read error: Operation timed out) [11:22] *** dashcloud has joined #archiveteam [11:31] *** Ymgve has joined #archiveteam [11:45] *** atomotic has joined #archiveteam [12:21] *** primus104 has quit IRC (Leaving.) [12:27] *** xtr-107 has quit IRC (Read error: Connection reset by peer) [12:29] *** xtr-201 has joined #archiveteam [12:30] *** username1 has joined #archiveteam [12:32] *** schbirid2 has quit IRC (Read error: Operation timed out) [12:41] *** BlueMaxim has quit IRC (Quit: Leaving) [12:46] *** bzc6p has joined #archiveteam [12:48] *** signius has quit IRC (Read error: Operation timed out) [12:49] *** nertzy2 has joined #archiveteam [12:50] pikhq: as far as I know this is the official wget bugtracking site: http://savannah.gnu.org/bugs/?group=wget [12:50] Reading your lines, I wonder if a bug that I filed is related to this one [12:51] http://savannah.gnu.org/bugs/?42794 [12:51] The encoding is UTF-8 but we still couldn't find the logic in the bug. [12:53] That was the moment (quite at the beginning) since I'm using wpull for website archivals. (I also miss the --retry-dns-error option from wget, which is crucial for me as I don't have a stable connection) [12:55] *** mistym has joined #archiveteam [12:55] *** nertzy2 has quit IRC (Read error: Operation timed out) [12:59] *** bzc6p has quit IRC (bzc6p) [13:02] *** mistym has quit IRC (Read error: Operation timed out) [13:03] *** signius has joined #archiveteam [13:36] *** Morbus has quit IRC (Quit: http://www.disobey.com/) [13:55] *** mistym has joined #archiveteam [13:59] *** Morbus has joined #archiveteam [14:00] *** Start has quit IRC (Disconnected.) [14:02] *** mistym has quit IRC (Read error: Operation timed out) [14:11] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [14:17] *** sankin has joined #archiveteam [14:30] *** dashcloud has quit IRC (Read error: Operation timed out) [14:31] *** garyrh has quit IRC (Ping timeout: 619 seconds) [14:33] *** dashcloud has joined #archiveteam [14:34] *** useretail has quit IRC (Ping timeout: 619 seconds) [14:37] *** Start has joined #archiveteam [14:38] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [14:41] *** yipdw has quit IRC (Ping timeout: 255 seconds) [14:41] *** swebb has quit IRC (Ping timeout: 255 seconds) [14:41] *** midas has quit IRC (Ping timeout: 255 seconds) [14:41] *** swebb has joined #archiveteam [14:43] *** yipdw has joined #archiveteam [14:44] *** dashcloud has quit IRC (Ping timeout: 483 seconds) [14:46] *** midas has joined #archiveteam [14:52] *** dashcloud has joined #archiveteam [14:55] *** Start has quit IRC (Disconnected.) [14:57] *** useretail has joined #archiveteam [14:58] *** Start has joined #archiveteam [14:59] *** mistym has joined #archiveteam [15:14] *** primus104 has joined #archiveteam [15:19] *** primus105 has joined #archiveteam [15:22] *** primus105 has quit IRC (Client Quit) [15:25] someone should probably grab this stuff, archivebot isn't working on the site http://www.dni.gov/index.php/resources/bin-laden-bookshelf [15:27] *** primus104 has quit IRC (Read error: Operation timed out) [15:36] *** Emcy has quit IRC (Read error: Connection reset by peer) [15:38] *** Jonimus has quit IRC (Ping timeout: 370 seconds) [15:51] *** Start has quit IRC (Disconnected.) [15:52] *** SmileyG has joined #archiveteam [15:53] *** tephra_ has joined #archiveteam [15:53] *** Quile_ has joined #archiveteam [15:56] *** mistym has quit IRC (Remote host closed the connection) [15:56] *** thechip_ has joined #archiveteam [16:01] *** tephra has quit IRC (hub.se irc.underworld.no) [16:01] *** Smiley has quit IRC (hub.se irc.underworld.no) [16:01] *** tsp_ has quit IRC (hub.se irc.underworld.no) [16:01] *** thechip has quit IRC (hub.se irc.underworld.no) [16:01] *** wm_ has quit IRC (hub.se irc.underworld.no) [16:01] *** dugo has quit IRC (hub.se irc.underworld.no) [16:01] *** Marc has quit IRC (hub.se irc.underworld.no) [16:01] *** raylee has quit IRC (hub.se irc.underworld.no) [16:01] *** Quile has quit IRC (hub.se irc.underworld.no) [16:01] *** Atluxity has quit IRC (hub.se irc.underworld.no) [16:10] *** dugo_ has joined #archiveteam [16:10] *** mistym has joined #archiveteam [16:24] *** philpem has quit IRC (Remote host closed the connection) [16:26] *** Emcy has joined #archiveteam [16:29] *** Ravenloft has joined #archiveteam [16:31] *** SimpBrain has joined #archiveteam [16:31] *** twrist has joined #archiveteam [16:35] *** garyrh has joined #archiveteam [16:35] *** primus104 has joined #archiveteam [16:41] *** aaaaaaaaa has joined #archiveteam [16:52] *** Start has joined #archiveteam [17:42] *** Start has quit IRC (Disconnected.) [17:49] *** aNthraXx_ has joined #archiveteam [17:52] *** aNthraXx has quit IRC (Read error: Operation timed out) [17:53] *** cadbury_ has quit IRC (Ping timeout: 606 seconds) [17:53] *** brayden_ has quit IRC (Ping timeout: 606 seconds) [17:56] *** caber has quit IRC (Ping timeout: 606 seconds) [17:59] *** aNthraXx_ has quit IRC (Read error: Operation timed out) [17:59] *** caber has joined #archiveteam [18:00] *** aNthraXx has joined #archiveteam [18:04] *** cadbury_ has joined #archiveteam [18:10] *** dashcloud has quit IRC (Read error: Operation timed out) [18:13] *** habi has joined #archiveteam [18:14] *** dashcloud has joined #archiveteam [18:17] *** raylee has joined #archiveteam [18:17] *** wm_ has joined #archiveteam [18:22] *** Emcy_ has joined #archiveteam [18:29] *** Emcy has quit IRC (Ping timeout: 512 seconds) [18:33] *** hlndr has joined #archiveteam [18:33] *** twrist has quit IRC (And now, for my next magic trick..) [18:37] *** primus104 has quit IRC (Leaving.) [18:53] *** sankin has quit IRC (Leaving.) [19:00] *** Emcy_ has quit IRC (Read error: Connection reset by peer) [19:05] *** mistym has quit IRC (Remote host closed the connection) [19:10] *** primus104 has joined #archiveteam [19:20] *** mistym has joined #archiveteam [19:43] https://github.com/venomous0x/WhatsAPI [19:54] I have a copy [19:54] https://github.com/yipdw/WhatsAPI/commits/master [19:55] as do 2,467 others [19:55] er sorry 1,921 others [19:55] that said, fuck WhatsApp [19:58] *** mistym has quit IRC (Remote host closed the connection) [20:08] cloned [20:11] yipdw: is that up to date? [20:13] *** mistym has joined #archiveteam [20:15] balrog: probably not [20:15] *** mistym has quit IRC (Read error: Connection reset by peer) [20:15] you'll want to comb the other 1,921 clones to check [20:16] *** mistym has joined #archiveteam [20:16] https://github.com/15786548135/WhatsAPI/commits/master [20:17] https://github.com/7aduta/WhatsAPI/commits/master [20:23] *** habi has left [20:26] Array.prototype.map.call(document.querySelectorAll("div.repo>a:nth-of-type(2)"), function (e) { return "git add remote "+ (e.href.match(/com\/([^\/]*)\//)[1]) +" "+ e.href +".git"; }); [20:27] *** human39_ has joined #archiveteam [20:28] document.documentElement.innerHTML = Array.prototype.map.call(document.querySelectorAll("div.repo>a:nth-of-type(2)"), function (e) { return "git add remote "+ (e.href.match(/com\/([^\/]*)\//)[1]) +" "+ e.href +".git"; }).join("
") [20:28] it's only 1k of the 1.9k remotes though [20:29] *** Start has joined #archiveteam [20:32] *** mistym has quit IRC (Remote host closed the connection) [20:48] *** mistym has joined #archiveteam [20:50] *** mistym has quit IRC (Remote host closed the connection) [20:53] *** kyan has joined #archiveteam [20:59] my net connection is acting up... [21:03] *** mistym has joined #archiveteam [21:04] *** Rickster has quit IRC (Quit: ZNC - http://znc.in) [21:08] *** mistym has quit IRC (Remote host closed the connection) [21:09] *** Rickster has joined #archiveteam [21:14] *** mistym has joined #archiveteam [21:16] *** Start has quit IRC (Disconnected.) [21:33] *** mistym has quit IRC (Remote host closed the connection) [21:35] *** BlueMaxim has joined #archiveteam [21:44] *** mistym has joined #archiveteam [22:01] interesting, the api only gives me 1822 [22:03] *** mistym has quit IRC (Remote host closed the connection) [22:05] *** mistym has joined #archiveteam [22:08] *** n00b169 has joined #archiveteam [22:10] *** n00b169 has quit IRC (Client Quit) [22:13] *** yuvadm has joined #archiveteam [22:14] looking for some advice on frameworks i can use to scrape the hell out of a blogging platform thats going down [22:14] before i start NIH'ing some code [22:21] *** toad1 has joined #archiveteam [22:22] knights of nih [22:22] *** toad2 has quit IRC (Read error: Operation timed out) [22:29] *** rumbles has joined #archiveteam [22:29] XD [22:29] @yipdw does archivebot support parsing a json payload of urls for processing? [22:29] Url in question: https://api.github.com/repos/venomous0x/WhatsAPI/pulls?state=open [22:29] *** Emcy has joined #archiveteam [22:30] heh, nice [22:33] yuvadm: wpull [22:34] rumbles: every pull request has an associated git ref [22:34] db48x: bless you, exactly what i need [22:34] py<3 [22:34] *** nertzy has joined #archiveteam [22:35] you're welcome [22:35] yuvadm: what's the blogging platform, maybe it should be an archive team project [22:35] @db48x: thanks! [22:36] DFJustin: i'd love for that to happen, but there's an i18n barrier, it's all in hebrew [22:36] israblog.co.il [22:36] you're welcome [22:36] largest israeli blogging platform since way back [22:37] yuvadm: sounds like a good candidate for the warrior then [22:37] http://tracker.archiveteam.org/ [22:37] why's warrior the better option in this case? [22:37] it's distributed, see http://tracker.archiveteam.org/furaffinity/ for an example [22:38] distributed = less likely for extraction to be banned/throttled [22:38] what's the input for warrior? a WARC? [22:38] a list of the tasks to do [22:39] profile:kaafan33 and submission:15872951-15873000, for example [22:39] cool [22:39] i'll take alook see if it fits the bill. who authorizes tasks for the warrior? [22:40] this is a pretty large project [22:40] each site we're working on has a separate git repository where the source for the pipeline is kept [22:40] *** Start has joined #archiveteam [22:40] https://github.com/ArchiveTeam/furaffinity-grab [22:41] looking at this one I see that it actually uses wpull [22:41] https://github.com/ArchiveTeam/furaffinity-grab/blob/master2/pipeline.py#L193 [22:41] db48x: that's awesome. gotta go afk, but i'll be back with more Q's for sure [22:42] db48x would you accept a PR for a Dockerfile to build pipelines if I built one? [22:42] you can see how it looks at the job ID to decide what to do; https://github.com/ArchiveTeam/furaffinity-grab/blob/master2/pipeline.py#L226 [22:42] yuvadm: sure, I'll be in and out as well [22:42] rumbles: possibly [22:44] thanks! [22:58] *** rumbles has quit IRC (Quit: Page closed) [23:03] *** REiN^ has quit IRC (Read error: Operation timed out) [23:04] *** REiN^ has joined #archiveteam [23:05] *** cadbury_ has quit IRC (Read error: Operation timed out) [23:06] *** dinomite_ has joined #archiveteam [23:07] *** Jonimus has joined #archiveteam [23:08] *** dinomite has quit IRC (Read error: Connection reset by peer) [23:08] rumbles: no, but cat https://api.github.com/repos/venomous0x/WhatsAPI/pulls?state=open | jq '.[].url' > FILE works [23:11] *** aNthraXx has quit IRC (Read error: No route to host) [23:13] he left, so he may ask again later [23:14] db48x: oh yeah we have a dockerfile already [23:14] heh [23:17] *** Sk1d has quit IRC (Ping timeout: 606 seconds) [23:18] *** Sk1d has joined #archiveteam [23:23] *** cadbury_ has joined #archiveteam [23:23] *** aNthraXx has joined #archiveteam [23:23] *** REiN^ has quit IRC (Ping timeout: 370 seconds) [23:27] *** lexicon has joined #archiveteam [23:35] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [23:38] *** REiN^ has joined #archiveteam [23:42] *** Sellyme has quit IRC (No Ping reply in 180 seconds.) [23:44] *** Sellyme has joined #archiveteam [23:54] *** SimpBrain has quit IRC (Ping timeout: 258 seconds)