[00:08] *** Gfy has quit IRC (Read error: Operation timed out) [00:22] *** Gfy has joined #archiveteam-bs [01:51] *** Mateon1 has quit IRC (Read error: Operation timed out) [01:52] *** Mateon1 has joined #archiveteam-bs [02:03] *** m007a83_ has joined #archiveteam-bs [02:07] *** m007a83__ has quit IRC (Read error: Operation timed out) [02:07] *** m007a83 has joined #archiveteam-bs [02:11] *** m007a83_ has quit IRC (Read error: Operation timed out) [02:42] *** m007a83_ has joined #archiveteam-bs [02:46] *** m007a83 has quit IRC (Read error: Operation timed out) [02:54] *** dashcloud has joined #archiveteam-bs [02:59] so i'm at 7k items so far this month [03:39] *** BlueMax has joined #archiveteam-bs [04:06] *** odemg has quit IRC (Read error: Operation timed out) [04:11] *** odemg has joined #archiveteam-bs [04:31] *** m007a83_ is now known as m007a83 [05:18] *** ReimuHaku has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [05:26] *** sun_shine has joined #archiveteam-bs [05:28] hi. I'm curious if anyone around has used public hosts files or ABP style filters in web crawling to avoid fetching a billion redundant trackers since this has been an issue in my own private crawls [05:31] I ended up making a plugin for grab-site that does something along these lines instead of making thousands of regex filters to check every link against. wondering if anyone out there has done the same because I'm not great at scripting [05:34] *** Selavi has quit IRC (Quit: verb. to stop or discontinue) [05:42] *** Selavi has joined #archiveteam-bs [06:06] *** namespace has quit IRC (Quit: Quit) [07:48] *** Mayonaise has quit IRC (Read error: Operation timed out) [07:48] *** schbirid has joined #archiveteam-bs [07:55] sun_shine: We're filtering out a few trackers in ArchiveBot in the global ignore set (which should also be in grab-site). Everything else is grabbed by default. Doesn't cause many issues in my experience. [07:56] Note that I distinguish between trackers and ads. Trackers are mostly useless in the context of web archival, but in my opinion, we should generally grab ads since they're part of what the user sees when visiting a web page. [07:56] The size difference is minimal but the number of connections was lower [07:57] *** Mayonaise has joined #archiveteam-bs [07:57] I was more thinking in terms of number of URLs grabbed. [07:57] I looked at python libraries for parsing adblock plus filters and decided my criteria were similar.. i'm more concerned about nuisance hosts than nuisance content [08:58] *** sun_shine has quit IRC (Quit: Leaving) [09:48] *** wp494_ has joined #archiveteam-bs [09:52] *** wp494 has quit IRC (Ping timeout: 252 seconds) [10:00] *** BlueMax has quit IRC (Leaving) [11:29] I think grab-site has some sort of igset? ignore sets. And the default does include some ad/tracker stuff? I think [11:31] ION I am happy that twitter's mobile site can work entirely without JS, which means I can archive things without needing phantomjs, making my crawls quicker, and less likely to have oodles of JS & tracker things. [11:34] Even the desktop site can be archived reasonably well without PhantomJS. Browsing the archives still requires JS though, I believe. [11:42] yeah, you get something that you can read (FSVO read). [11:42] My goal is for something that you could take a screenshot of years later and it looks "normal". [11:43] Plain ol' wpull is working well for me. [11:44] That should probably work with the desktop site (unless browsers suddenly break support for HTML or JS features, in which case you'd have to dig out an old browser version). [11:45] But yeah, the mobile site is probably better for it. Unfortunately, it's not what's linked in most parts of the web, and many people probably don't even know about it. So for playback in the Wayback Machine, we still need to grab the desktop site. [11:47] Sure. My goal isn't to make something that fixes linkrot for the web, but to archive things that were said on twitter. I'm OK with someone (me?) having to manually dig things out later [11:47] Alright. [11:48] Remind me, this is about the upcoming elections in Ireland, right? [11:51] One referendum. In ~3 weeks. about abortion. [11:52] Ah [11:52] There was a same-sex marriage one about 3 years ago. And about a year ago I noticed one of the campaign groups deactived the facebook account. [11:53] Probably afraid that in ~10 years time the comment from some aspiring politician might be dug up and embarassed. :P [11:53] Well, public archives would be good for something like this. So if you only intend to grab the mobile Twitter page, please let me know which accounts need to be preserved, and I'll cast a spell. [11:57] OK. :) I'm building a list. Some of these groups have been around since ~90s, and their websites are still in the internet archive. They haven't figured out the robots.txt thing. So that's good [11:58] Thanks :-) It might also be a good idea to regrab those old websites since they might have content specific to this referendum etc. [11:59] I've downloaded all the historic versions from IA ;) [12:00] I use a FF extension for archiving a page to IA & archive.is, which I use sometimes. I hope that tells IA "hey here's a site you might like" [12:06] *** HCross has quit IRC (Read error: Operation timed out) [12:07] *** HCross has joined #archiveteam-bs [12:34] *** HCross has quit IRC (Read error: Connection reset by peer) [12:39] *** HCross has joined #archiveteam-bs [12:56] *** mistym has quit IRC (Quit: ZNC - http://znc.in) [13:05] *** mistym has joined #archiveteam-bs [13:11] *** HCross_ has joined #archiveteam-bs [13:16] *** HCross has quit IRC (Read error: Operation timed out) [13:16] *** HCross_ is now known as HCross [13:22] *** lindalap_ has joined #archiveteam-bs [13:22] *** lindalap has quit IRC (Read error: Connection reset by peer) [13:22] *** lindalap_ is now known as lindalap [14:28] *** ReimuHaku has joined #archiveteam-bs [14:52] Ung. I've a problem with wpull. I think it is not downloading page requisites when you specify --recursive... :( [14:52] (I have --span-hosts-allow page-requisites) [14:54] --page-requisites [14:58] yes I have that too. :) [14:59] If I change the command line, to only remove "--level 1 --recursive" flags, then I get page reqs (obv just for the one URL that I tell it to download) [15:11] *** icedice has joined #archiveteam-bs [15:13] It does dl some page reqs, like images?? [15:38] *** icedice2 has joined #archiveteam-bs [15:42] *** icedice has quit IRC (Ping timeout: 252 seconds) [15:59] *** icedice2 has quit IRC (Quit: Leaving) [16:00] *** godane has quit IRC (Read error: Operation timed out) [16:15] *** godane has joined #archiveteam-bs [16:15] *** svchfoo3 sets mode: +o godane [16:31] *** m007a83 has quit IRC (Quit: Leaving) [16:33] *** Mateon1 has quit IRC (Ping timeout: 633 seconds) [16:33] *** Mateon1 has joined #archiveteam-bs [16:45] *** m007a83 has joined #archiveteam-bs [17:11] *** jschwart has joined #archiveteam-bs [17:11] *** slyphic has quit IRC (Quit: leaving) [17:11] *** slyphic has joined #archiveteam-bs [18:14] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [18:26] *** m007a83 has joined #archiveteam-bs [18:30] *** m007a83 has quit IRC (Client Quit) [19:21] *** m007a83 has joined #archiveteam-bs [20:35] *** nyaomi has quit IRC (Read error: Connection reset by peer) [20:39] *** nyaomi has joined #archiveteam-bs [21:09] *** wp494_ is now known as wp494 [21:20] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [21:20] *** wp494 has joined #archiveteam-bs [22:18] *** lindalap has quit IRC (Read error: Connection reset by peer) [22:19] *** lindalap has joined #archiveteam-bs [22:21] *** schbirid has quit IRC (Quit: Leaving) [22:23] *** BlueMax has joined #archiveteam-bs [22:53] *** RichardG_ has joined #archiveteam-bs [22:53] *** RichardG has quit IRC (Read error: Connection reset by peer) [22:58] *** jschwart has quit IRC (Quit: Konversation terminated!) [23:08] *** lindalap_ has joined #archiveteam-bs [23:08] *** lindalap has quit IRC (Read error: Connection reset by peer) [23:08] *** lindalap_ is now known as lindalap [23:44] *** godane has quit IRC (Ping timeout: 260 seconds) [23:50] *** Mateon1 has quit IRC (Ping timeout: 252 seconds) [23:50] *** Mateon1 has joined #archiveteam-bs