[00:08] *** Sk1d has quit IRC (Read error: Operation timed out) [00:11] *** Sk1d has joined #archiveteam-bs [00:24] *** Sk1d has quit IRC (Read error: Operation timed out) [00:27] *** Sk1d has joined #archiveteam-bs [00:40] *** Sk1d has quit IRC (Read error: Operation timed out) [00:45] *** Sk1d has joined #archiveteam-bs [00:46] Its not in my lists [00:57] *** Sk1d has quit IRC (Read error: Operation timed out) [01:02] *** Sk1d has joined #archiveteam-bs [01:13] *** Sk1d has quit IRC (Read error: Operation timed out) [01:18] *** Sk1d has joined #archiveteam-bs [01:23] *** VerifiedJ has quit IRC (Quit: Leaving) [01:30] *** Sk1d has quit IRC (Read error: Operation timed out) [01:35] *** Sk1d has joined #archiveteam-bs [01:39] Doomtay [01:51] *** Sk1d has quit IRC (Read error: Operation timed out) [01:51] *** balrog has quit IRC (Quit: Bye) [01:56] *** Sk1d has joined #archiveteam-bs [01:56] *** balrog has joined #archiveteam-bs [01:56] *** swebb sets mode: +o balrog [02:07] *** Sk1d has quit IRC (Read error: Operation timed out) [02:11] *** Sk1d has joined #archiveteam-bs [03:16] *** Sk1d has quit IRC (Read error: Operation timed out) [03:19] *** Sk1d has joined #archiveteam-bs [03:35] *** Sk1d has quit IRC (Read error: Operation timed out) [03:37] *** Sk1d has joined #archiveteam-bs [03:51] *** K4k has quit IRC (Read error: Connection reset by peer) [03:58] *** m007a83_ is now known as m007a83 [04:05] *** Sk1d has quit IRC (Read error: Operation timed out) [04:10] *** Sk1d has joined #archiveteam-bs [04:42] *** qw3rty116 has joined #archiveteam-bs [04:43] *** Sk1d has quit IRC (Read error: Operation timed out) [04:46] *** Sk1d has joined #archiveteam-bs [04:47] *** qw3rty115 has quit IRC (Read error: Operation timed out) [04:55] *** odemg has quit IRC (Ping timeout: 246 seconds) [04:59] *** Sk1d has quit IRC (Read error: Operation timed out) [05:03] *** Sk1d has joined #archiveteam-bs [05:05] *** zerkalo_ has quit IRC (Read error: Operation timed out) [05:07] *** zerkalo has joined #archiveteam-bs [05:09] *** jspiros has quit IRC (Read error: Operation timed out) [05:09] *** sknebel_ has quit IRC (Read error: Operation timed out) [05:10] *** Petri152 has quit IRC (Ping timeout: 246 seconds) [05:11] *** joepie91 has quit IRC (Ping timeout: 246 seconds) [05:11] *** odemg has joined #archiveteam-bs [05:11] *** sknebel has joined #archiveteam-bs [05:11] *** Coderjo has quit IRC (Ping timeout: 246 seconds) [05:12] *** JAA has quit IRC (Ping timeout: 246 seconds) [05:12] *** zyphlar has quit IRC (Ping timeout: 246 seconds) [05:12] *** svchfoo1 has quit IRC (Ping timeout: 246 seconds) [05:13] *** Coderjo has joined #archiveteam-bs [05:15] *** c4rc4s has quit IRC (Read error: Operation timed out) [05:15] *** Mayonaise has quit IRC (Ping timeout: 492 seconds) [05:18] *** Muad-Dib has quit IRC (Ping timeout: 260 seconds) [05:19] *** joepie91 has joined #archiveteam-bs [05:20] *** Sk1d has quit IRC (Read error: Operation timed out) [05:22] *** Sk1d has joined #archiveteam-bs [05:23] *** godane has quit IRC (Ping timeout: 252 seconds) [05:25] *** Mayonaise has joined #archiveteam-bs [05:29] *** Muad-Dib has joined #archiveteam-bs [05:39] *** kidneybea has joined #archiveteam-bs [05:53] *** kyounko has joined #archiveteam-bs [06:07] *** godane has joined #archiveteam-bs [06:10] *** svchfoo1 has joined #archiveteam-bs [06:10] *** c4rc4s has joined #archiveteam-bs [06:10] *** Petri152 has joined #archiveteam-bs [06:10] *** zyphlar has joined #archiveteam-bs [06:10] In regards to attempting to archive Lenny Letter - https://www.lennyletter.com/ - this one has a infinite scrolling feature, but you have to click on "Load More Stories" to get more of the articles [06:10] *** svchfoo3 sets mode: +o svchfoo1 [06:11] *** JAA has joined #archiveteam-bs [06:11] *** swebb sets mode: +o JAA [06:11] *** bakJAA sets mode: +o JAA [06:13] *** jspiros has joined #archiveteam-bs [06:24] *** Sk1d has quit IRC (Read error: Operation timed out) [06:25] *** zerkalo has quit IRC (Remote host closed the connection) [06:28] *** Sk1d has joined #archiveteam-bs [06:29] Ryz: Seems like it has a detailed sitemap.xml: https://www.lennyletter.com/sitemap.xml [06:29] Which points to other XML files which seem to point to stories: https://www.lennyletter.com/sitemap.xml?year=2015&month=10&week=2 [06:30] is this a job for archivebot? [06:30] Sadly, I have no idea how to extract the links in a way that's not cumbersome [06:30] *** Atom-- has joined #archiveteam-bs [06:30] I'm not sure if ArchiveBot can handle it if it has infinite scrolling [06:31] I gave it a try [06:31] Well it doesn't really if you can get the links from the XML files [06:31] Not sure if archiveboot will pick those up, but some crawlers probably will. [06:31] s/boot/bot [06:33] *** Atom__ has quit IRC (Ping timeout: 252 seconds) [06:34] *** Atom__ has joined #archiveteam-bs [06:36] *** Atom-- has quit IRC (Ping timeout: 252 seconds) [06:40] *** Sk1d has quit IRC (Read error: Operation timed out) [06:45] *** Sk1d has joined #archiveteam-bs [06:57] Ryz, Flashfire: Here's a list of links to posts extracted from the sitemaps, if it helps: https://transfer.sh/10HhxI/lenny-letter-list.txt [06:57] (Mostly I wanted a chance to play with shell commands.) [07:08] Awesome, thanks jodizzle - let's hope that's all of the articles [07:08] The lack of a timestamp on the URLs themselves makes this tricky [07:10] *** kidneybea has quit IRC (Quit: Page closed) [07:29] *** Sk1d has quit IRC (Read error: Operation timed out) [07:31] *** Sk1d has joined #archiveteam-bs [07:44] *** Sk1d has quit IRC (Read error: Operation timed out) [07:48] *** Sk1d has joined #archiveteam-bs [07:52] Ryz: The sitemap files are timestampped with year, month and week, so we can at least know that much. [07:53] If you need it for some reason. [08:01] *** Sk1d has quit IRC (Read error: Operation timed out) [08:04] *** Sk1d has joined #archiveteam-bs [08:08] Ryz, Flashfire: Here's Lenny Letter social media after snscraping, if you wanna throw this in archivebot: https://transfer.sh/169FA3/lenny-letter-facebook.txt https://transfer.sh/4THNf/lenny-letter-twitter.txt https://transfer.sh/wuEhd/lenny-letter-instagram.txt [08:09] And here's the youtube if anyone wants to get it: https://www.youtube.com/channel/UCDfky0ey-Gb6SqoF3wRhHWQ [08:16] *** Sk1d has quit IRC (Read error: Operation timed out) [08:19] *** godane has quit IRC (Ping timeout: 268 seconds) [08:19] *** Sk1d has joined #archiveteam-bs [08:23] Now what to do with USA Today's Overwatch Wire website - https://overwatchwire.usatoday.com/ [08:26] I can snscrape the social media links. [08:33] *** Sk1d has quit IRC (Read error: Operation timed out) [08:35] Ryz: https://transfer.sh/QK5ZD/overwatchwire-twitter.txt https://transfer.sh/jTjhU/overwatchwire-facebook.txt [08:36] *** Sk1d has joined #archiveteam-bs [08:40] !status bc7mt8vl9cxi1t16gq99o8imr [08:41] Whoops :p [08:48] *** Sk1d has quit IRC (Read error: Operation timed out) [08:53] *** Sk1d has joined #archiveteam-bs [09:01] *** Petri152 has quit IRC (Read error: Operation timed out) [09:01] *** zyphlar has quit IRC (Read error: Operation timed out) [09:01] *** JAA has quit IRC (Read error: Operation timed out) [09:02] *** svchfoo1 has quit IRC (Read error: Operation timed out) [09:02] *** jspiros has quit IRC (Read error: Operation timed out) [09:03] *** c4rc4s has quit IRC (Read error: Operation timed out) [09:12] *** Sk1d has quit IRC (Read error: Operation timed out) [09:13] *** achip has quit IRC (Read error: Operation timed out) [09:14] *** Sk1d has joined #archiveteam-bs [09:16] And thanks for the .txt files jodizzle once more [09:17] *** achip has joined #archiveteam-bs [10:00] *** jspiros has joined #archiveteam-bs [10:00] *** zyphlar has joined #archiveteam-bs [10:00] *** svchfoo1 has joined #archiveteam-bs [10:01] *** c4rc4s has joined #archiveteam-bs [10:01] *** Petri152 has joined #archiveteam-bs [10:01] *** svchfoo3 sets mode: +o svchfoo1 [10:02] *** JAA has joined #archiveteam-bs [10:02] *** swebb sets mode: +o JAA [10:02] *** bakJAA sets mode: +o JAA [10:05] Hmm, unsure if such an article has been shared before here, but could take in an interest in archiving Chinese websites: https://qz.com/1166024/china-shut-down-over-13000-illegal-websites-and-10-million-user-accounts-since-2015/ [10:06] *** BlueMax has quit IRC (Read error: Connection reset by peer) [10:16] Not sure if The Debrief website is still accessible or not - https://thedebrief.co.uk/ - the website unfortunately coughs out the explicit message of "Your connection is not private" [10:19] A bit of an eye-opener that The Debrief doesn't have a Wikipedia article yet [10:53] jodizzle, Ryz: ArchiveBot does parse sitemaps. It won't do the infinite scrolling though. [11:30] *** Sk1d has quit IRC (Read error: Operation timed out) [11:34] *** Sk1d has joined #archiveteam-bs [11:49] *** Sk1d has quit IRC (Read error: Operation timed out) [11:53] *** Sk1d has joined #archiveteam-bs [12:08] *** Sk1d has quit IRC (Read error: Operation timed out) [12:08] *** zerkalo has joined #archiveteam-bs [12:11] *** Sk1d has joined #archiveteam-bs [12:22] *** godane has joined #archiveteam-bs [12:23] *** godane has quit IRC (Client Quit) [12:29] *** Sk1d has quit IRC (Read error: Operation timed out) [12:32] *** Sk1d has joined #archiveteam-bs [12:36] *** godane has joined #archiveteam-bs [12:42] *** Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) [12:45] *** Sk1d has quit IRC (Read error: Operation timed out) [12:45] *** twoTBHetz has quit IRC (Remote host closed the connection) [12:50] *** Sk1d has joined #archiveteam-bs [13:02] *** Sk1d has quit IRC (Read error: Operation timed out) [13:04] *** Sk1d has joined #archiveteam-bs [13:17] *** Sk1d has quit IRC (Read error: Operation timed out) [13:17] *** Pixi` has quit IRC (Quit: Pixi`) [13:20] *** Sk1d has joined #archiveteam-bs [13:22] *** twoTBHetz has joined #archiveteam-bs [13:33] Ryz if you read logs: thedebrief.co.uk is not accessible, it redirects to another website and has a bad SSL config that spews errors in browser [13:34] *** Pixi has joined #archiveteam-bs [13:35] *** Sk1d has quit IRC (Read error: Operation timed out) [13:40] *** Sk1d has joined #archiveteam-bs [13:53] *** VerifiedJ has joined #archiveteam-bs [14:05] *** Mayonaise has quit IRC (Read error: Operation timed out) [14:06] *** Sk1d has quit IRC (Read error: Operation timed out) [14:08] *** Sk1d has joined #archiveteam-bs [14:12] *** wp494 has quit IRC (Ping timeout: 260 seconds) [14:13] *** wp494 has joined #archiveteam-bs [14:22] *** Sk1d has quit IRC (Read error: Operation timed out) [14:27] *** Sk1d has joined #archiveteam-bs [14:35] *** Mayonaise has joined #archiveteam-bs [14:41] *** Sk1d has quit IRC (Read error: Operation timed out) [14:43] *** Sk1d has joined #archiveteam-bs [15:02] *** VerifiedJ has quit IRC (Ping timeout: 252 seconds) [15:11] *** Hiccup has joined #archiveteam-bs [15:12] *** VerifiedJ has joined #archiveteam-bs [15:13] Anyone have any idea what might cause this pywb issue?: https://github.com/webrecorder/pywb/issues/406 [15:13] Or is pywb not a recommended "complicated website" recorder? [15:13] Also, does pywb work if you have to use special user agents and client certificates to access the website? [15:32] *** VerifiedJ has quit IRC (Ping timeout: 252 seconds) [15:36] Hiccup: UAs are not a problem, but client certificates will likely be one. [15:37] Any idea how I can overcome the problem? [15:37] (btw the issue with pywb I posted above is probably unrelated as I tried it on normal websites too) [15:42] Hiccup: I think you said before that you need a proper browser as well, right? In that case, it'll probably be very hard. [15:42] well it doesn't need to be a proper browser [15:42] it can be a very basic browser [15:43] *** VerifiedJ has joined #archiveteam-bs [15:43] might not even need JS [15:43] Oh [15:43] In that case, try wpull. [15:43] You can specify a client certificate with --certificate and the related options. [15:44] So that's something that will archive urls you feed it or work recursively? [15:45] That would be okay, except the website I want to archive doesn't actually use proper URLs [15:45] Can do both. [15:45] I'd need it to be recursive, but it would essentially need to grep the html for urls that are in custom tags [15:45] (this website is intended to be used with a custom browser) [15:46] In that case, you might have to use the get_urls hook to extract those additional URLs. [16:14] Actually [16:14] just recursive searching won't be enough. [16:14] There will need to be a manual element to it... [16:14] Are there any tools similar to pywb, that support client certs? [16:29] there's warcprox, but I'm not sure if it works with custom client certs [16:29] *** moufu_ is now known as moufu [16:33] Yeah, I don't think it does since it opens its own SSL connection. [16:33] Same with all the other MITM proxies. [16:38] I may actually be able to use wget. I can just browse the website normally, noting URLs down (and their likely variations down), then run them all through wget. Then I can grep those files for any extra URLs. [16:39] *** VerifiedJ has quit IRC (Ping timeout: 252 seconds) [16:39] I don't think there's any need to use WARC. I've setup wget to record all the HTTP Headers to a file, so having that + the files is basically going to be everything useful, I think. [16:40] *** Sk1d has quit IRC (Read error: Operation timed out) [16:45] *** Sk1d has joined #archiveteam-bs [16:50] WARC is the only way to get accepted into Wayback afaik everything else is disregarded as not legit Hiccup [16:50] yeah [16:51] warcs are the feedstock of wayback, but they have to be blessed because IA doesn't want to publish history that is false [16:51] that adds to it ^ [16:51] I'm not sure if this could go into wayback anyway, because of the useragent+clientcerts [16:54] wget can output warcs too [16:58] *** Sk1d has quit IRC (Read error: Operation timed out) [17:02] *** Sk1d has joined #archiveteam-bs [17:11] true [17:11] but in an old format I think [17:11] I'm not sure how IA could check if WARCs are genuine [17:11] WARCs from any random person like me [17:18] there isn't a technical solution to this social problem [17:22] you are probably right about that [17:22] its just a trust thing [17:22] websites aren't "self-verifying" or anything [17:24] do we want a warrior job for hostinger? [17:24] just got the ping from betamax [17:25] *** Hiccup has quit IRC (Remote host closed the connection) [17:25] we can look into that when we have a good list [17:26] i am currently working on a larger list for hostinger what we have so far is fine though [17:26] what is the problem with my current list? [17:29] (which was at https://pastebin.com/ANJQcSut ) [17:29] arkiver ? [17:41] *** VerifiedJ has joined #archiveteam-bs [18:21] *** ndiddy has joined #archiveteam-bs [18:29] *** Martle_ has joined #archiveteam-bs [18:29] *** Martle_ has quit IRC (Remote host closed the connection) [18:30] *** Martle_ has joined #archiveteam-bs [18:35] *** Martle has quit IRC (Read error: Operation timed out) [18:46] mhh i am now looking which websites have the depricated header otherwise you might have a problem with 200000 random sites which are not under threat at all [18:52] *** Mateon1 has quit IRC (Ping timeout: 265 seconds) [18:53] *** Mateon1 has joined #archiveteam-bs [19:24] VoynichCr: This is your bot, right? ↑ [19:25] *** ndiddy has quit IRC () [19:26] PurpleSym: yes [19:27] well, that were my edits, my bot sign edits with "BOT" in summary [19:27] that were hand-made [19:32] would 1-16 mBit/s of request be considered a threat? [19:33] VoynichCr: I added that info to your userpage. [19:34] Shall we create a category for pages your bot updates? [19:44] twoTBHetz: bandwidth is not a useful metric unless it is purely static files [19:47] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [19:49] *** dashcloud has joined #archiveteam-bs [19:49] *** Ryz has joined #archiveteam-bs [20:04] PurpleSym: i created this https://www.archiveteam.org/index.php?title=Category:ArchiveBot [20:08] This does not include the lists Death/Disestablishments in X though. I just thought it would be nice if there was a way to list all auto-updated pages. [20:15] *** VADemon_ has joined #archiveteam-bs [20:19] *** VADemon has quit IRC (Ping timeout: 255 seconds) [20:25] twoTBHetz: depending on the size of the lists, I would perhaps be inclined to get all the the urls regardless of whether or not they have the 'closing' banner [20:26] I know that some hostinger sites (myself included, before I moved off them) added bits of JS to prevent hostinger injecting extra bits into the site [20:27] so the fact that there is not a banner does not mean the site is going to stick around [20:27] I also don't trust hostinger, and getting a complete grab while we have the chance seems a good idea [20:27] mhh there are 15 uninteresting urls for each interesting one [20:28] the banner in html and my client executes no js so the banner will be there [20:28] ah, OK. [20:29] but could you keep a list of all the non-banner sites as well, as maybe what could happen is we could go after the at-risk ones first, then do the less-at-risk-but-still-Hostinger ones after [20:31] I am now at 58219 from 226717 from a reasonable selection from the huge json i got [20:31] i kept the 200MB Json with lines 1372060 and lots of dupplication [20:32] and irrelevant hostinger-clone urls like javahostinger [20:32] what do you mean by "reasonable selection"? [20:32] *** Sk1d has quit IRC (Read error: Operation timed out) [20:33] everything that has a cpanel will not be a free hosting [20:34] ah, so how did you work out which ones were free hosting? [20:34] or am I misunderstanding things? [20:36] i saw that a reasonable portion had also cpanel.dom.tld domains and i removed cpanel.dom.tld and dom.tld since it is not going to be a free hosting [20:37] *** Sk1d has joined #archiveteam-bs [20:38] oh, I get it - thanks for the clarification [20:38] my orginal dataset this time was dns and there were lot's of dns level redirections to other hosters. [20:39] i removed redirectios to yahoo, google, stuff with "hosting-" in it but is still kept the orignal dns collection [20:41] or "ovh", "neko", "systems", "yaya", "natro" ... as substring [20:43] i am at 67258 [20:48] meaning more hours 5? we will see [20:48] great - it's a pain we don't have an exact deadline as to when they're closing [20:49] you already got a list from me :) [20:49] this with a second dataset [20:53] next time i do more than 100 parallel connections [21:23] *** schbirid has quit IRC (Remote host closed the connection) [21:24] *** Sk1d has quit IRC (Read error: Operation timed out) [21:27] *** Sk1d has joined #archiveteam-bs [21:32] can someone throw https://forum.gamestm.co.uk/index.php into archivebot [21:32] or tell me how so it works, concidering phpbb or something forum. [21:39] *** Sk1d has quit IRC (Read error: Operation timed out) [21:42] *** BlueMax has joined #archiveteam-bs [21:43] *** Sk1d has joined #archiveteam-bs [22:45] something weird has happened progress stopped dead in it tracks [22:49] no new site in 7 minutes but i can still curl them from the same machine [22:49] *** Sk1d has quit IRC (Read error: Operation timed out) [22:52] *** Sk1d has joined #archiveteam-bs [23:04] *** Sk1d has quit IRC (Read error: Operation timed out) [23:09] *** Sk1d has joined #archiveteam-bs [23:10] thats so weird i can access all entries up to a certain alphabetical entry. and after i can not reach it from the server [23:12] *** Mateon1 has quit IRC (Remote host closed the connection) [23:12] *** Mateon1 has joined #archiveteam-bs [23:21] *** Sk1d has quit IRC (Read error: Operation timed out) [23:21] * twoTBHetz is highly confused and slightly paniced [23:25] *** Sk1d has joined #archiveteam-bs [23:39] *** Sk1d has quit IRC (Read error: Operation timed out) [23:42] *** Sk1d has joined #archiveteam-bs [23:54] *** Sk1d has quit IRC (Read error: Operation timed out) [23:54] *** atomicthu has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [23:54] *** atomicthu has joined #archiveteam-bs [23:58] *** Sk1d has joined #archiveteam-bs