#archiveteam-bs 2018-05-04,Fri

↑back Search

Time	Nickname	Message
00:08 ^🔗		Gfy has quit IRC (Read error: Operation timed out)
00:22 ^🔗		Gfy has joined #archiveteam-bs
01:51 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
01:52 ^🔗		Mateon1 has joined #archiveteam-bs
02:03 ^🔗		m007a83_ has joined #archiveteam-bs
02:07 ^🔗		m007a83__ has quit IRC (Read error: Operation timed out)
02:07 ^🔗		m007a83 has joined #archiveteam-bs
02:11 ^🔗		m007a83_ has quit IRC (Read error: Operation timed out)
02:42 ^🔗		m007a83_ has joined #archiveteam-bs
02:46 ^🔗		m007a83 has quit IRC (Read error: Operation timed out)
02:54 ^🔗		dashcloud has joined #archiveteam-bs
02:59 ^🔗	godane	so i'm at 7k items so far this month
03:39 ^🔗		BlueMax has joined #archiveteam-bs
04:06 ^🔗		odemg has quit IRC (Read error: Operation timed out)
04:11 ^🔗		odemg has joined #archiveteam-bs
04:31 ^🔗		m007a83_ is now known as m007a83
05:18 ^🔗		ReimuHaku has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
05:26 ^🔗		sun_shine has joined #archiveteam-bs
05:28 ^🔗	sun_shine	hi. I'm curious if anyone around has used public hosts files or ABP style filters in web crawling to avoid fetching a billion redundant trackers since this has been an issue in my own private crawls
05:31 ^🔗	sun_shine	I ended up making a plugin for grab-site that does something along these lines instead of making thousands of regex filters to check every link against. wondering if anyone out there has done the same because I'm not great at scripting
05:34 ^🔗		Selavi has quit IRC (Quit: verb. to stop or discontinue)
05:42 ^🔗		Selavi has joined #archiveteam-bs
06:06 ^🔗		namespace has quit IRC (Quit: Quit)
07:48 ^🔗		Mayonaise has quit IRC (Read error: Operation timed out)
07:48 ^🔗		schbirid has joined #archiveteam-bs
07:55 ^🔗	JAA	sun_shine: We're filtering out a few trackers in ArchiveBot in the global ignore set (which should also be in grab-site). Everything else is grabbed by default. Doesn't cause many issues in my experience.
07:56 ^🔗	JAA	Note that I distinguish between trackers and ads. Trackers are mostly useless in the context of web archival, but in my opinion, we should generally grab ads since they're part of what the user sees when visiting a web page.
07:56 ^🔗	sun_shine	The size difference is minimal but the number of connections was lower
07:57 ^🔗		Mayonaise has joined #archiveteam-bs
07:57 ^🔗	JAA	I was more thinking in terms of number of URLs grabbed.
07:57 ^🔗	sun_shine	I looked at python libraries for parsing adblock plus filters and decided my criteria were similar.. i'm more concerned about nuisance hosts than nuisance content
08:58 ^🔗		sun_shine has quit IRC (Quit: Leaving)
09:48 ^🔗		wp494_ has joined #archiveteam-bs
09:52 ^🔗		wp494 has quit IRC (Ping timeout: 252 seconds)
10:00 ^🔗		BlueMax has quit IRC (Leaving)
11:29 ^🔗	ebel	I think grab-site has some sort of igset? ignore sets. And the default does include some ad/tracker stuff? I think
11:31 ^🔗	ebel	ION I am happy that twitter's mobile site can work entirely without JS, which means I can archive things without needing phantomjs, making my crawls quicker, and less likely to have oodles of JS & tracker things.
11:34 ^🔗	JAA	Even the desktop site can be archived reasonably well without PhantomJS. Browsing the archives still requires JS though, I believe.
11:42 ^🔗	ebel	yeah, you get something that you can read (FSVO read).
11:42 ^🔗	ebel	My goal is for something that you could take a screenshot of years later and it looks "normal".
11:43 ^🔗	ebel	Plain ol' wpull is working well for me.
11:44 ^🔗	JAA	That should probably work with the desktop site (unless browsers suddenly break support for HTML or JS features, in which case you'd have to dig out an old browser version).
11:45 ^🔗	JAA	But yeah, the mobile site is probably better for it. Unfortunately, it's not what's linked in most parts of the web, and many people probably don't even know about it. So for playback in the Wayback Machine, we still need to grab the desktop site.
11:47 ^🔗	ebel	Sure. My goal isn't to make something that fixes linkrot for the web, but to archive things that were said on twitter. I'm OK with someone (me?) having to manually dig things out later
11:47 ^🔗	JAA	Alright.
11:48 ^🔗	JAA	Remind me, this is about the upcoming elections in Ireland, right?
11:51 ^🔗	ebel	One referendum. In ~3 weeks. about abortion.
11:52 ^🔗	JAA	Ah
11:52 ^🔗	ebel	There was a same-sex marriage one about 3 years ago. And about a year ago I noticed one of the campaign groups deactived the facebook account.
11:53 ^🔗	ebel	Probably afraid that in ~10 years time the comment from some aspiring politician might be dug up and embarassed. :P
11:53 ^🔗	JAA	Well, public archives would be good for something like this. So if you only intend to grab the mobile Twitter page, please let me know which accounts need to be preserved, and I'll cast a spell.
11:57 ^🔗	ebel	OK. :) I'm building a list. Some of these groups have been around since ~90s, and their websites are still in the internet archive. They haven't figured out the robots.txt thing. So that's good
11:58 ^🔗	JAA	Thanks :-) It might also be a good idea to regrab those old websites since they might have content specific to this referendum etc.
11:59 ^🔗	ebel	I've downloaded all the historic versions from IA ;)
12:00 ^🔗	ebel	I use a FF extension for archiving a page to IA & archive.is, which I use sometimes. I hope that tells IA "hey here's a site you might like"
12:06 ^🔗		HCross has quit IRC (Read error: Operation timed out)
12:07 ^🔗		HCross has joined #archiveteam-bs
12:34 ^🔗		HCross has quit IRC (Read error: Connection reset by peer)
12:39 ^🔗		HCross has joined #archiveteam-bs
12:56 ^🔗		mistym has quit IRC (Quit: ZNC - http://znc.in)
13:05 ^🔗		mistym has joined #archiveteam-bs
13:11 ^🔗		HCross_ has joined #archiveteam-bs
13:16 ^🔗		HCross has quit IRC (Read error: Operation timed out)
13:16 ^🔗		HCross_ is now known as HCross
13:22 ^🔗		lindalap_ has joined #archiveteam-bs
13:22 ^🔗		lindalap has quit IRC (Read error: Connection reset by peer)
13:22 ^🔗		lindalap_ is now known as lindalap
14:28 ^🔗		ReimuHaku has joined #archiveteam-bs
14:52 ^🔗	ebel	Ung. I've a problem with wpull. I think it is not downloading page requisites when you specify --recursive... :(
14:52 ^🔗	ebel	(I have --span-hosts-allow page-requisites)
14:54 ^🔗	JAA	--page-requisites
14:58 ^🔗	ebel	yes I have that too. :)
14:59 ^🔗	ebel	If I change the command line, to only remove "--level 1 --recursive" flags, then I get page reqs (obv just for the one URL that I tell it to download)
15:11 ^🔗		icedice has joined #archiveteam-bs
15:13 ^🔗	ebel	It does dl some page reqs, like images??
15:38 ^🔗		icedice2 has joined #archiveteam-bs
15:42 ^🔗		icedice has quit IRC (Ping timeout: 252 seconds)
15:59 ^🔗		icedice2 has quit IRC (Quit: Leaving)
16:00 ^🔗		godane has quit IRC (Read error: Operation timed out)
16:15 ^🔗		godane has joined #archiveteam-bs
16:15 ^🔗		svchfoo3 sets mode: +o godane
16:31 ^🔗		m007a83 has quit IRC (Quit: Leaving)
16:33 ^🔗		Mateon1 has quit IRC (Ping timeout: 633 seconds)
16:33 ^🔗		Mateon1 has joined #archiveteam-bs
16:45 ^🔗		m007a83 has joined #archiveteam-bs
17:11 ^🔗		jschwart has joined #archiveteam-bs
17:11 ^🔗		slyphic has quit IRC (Quit: leaving)
17:11 ^🔗		slyphic has joined #archiveteam-bs
18:14 ^🔗		m007a83 has quit IRC (Ping timeout: 252 seconds)
18:26 ^🔗		m007a83 has joined #archiveteam-bs
18:30 ^🔗		m007a83 has quit IRC (Client Quit)
19:21 ^🔗		m007a83 has joined #archiveteam-bs
20:35 ^🔗		nyaomi has quit IRC (Read error: Connection reset by peer)
20:39 ^🔗		nyaomi has joined #archiveteam-bs
21:09 ^🔗		wp494_ is now known as wp494
21:20 ^🔗		wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
21:20 ^🔗		wp494 has joined #archiveteam-bs
22:18 ^🔗		lindalap has quit IRC (Read error: Connection reset by peer)
22:19 ^🔗		lindalap has joined #archiveteam-bs
22:21 ^🔗		schbirid has quit IRC (Quit: Leaving)
22:23 ^🔗		BlueMax has joined #archiveteam-bs
22:53 ^🔗		RichardG_ has joined #archiveteam-bs
22:53 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
22:58 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
23:08 ^🔗		lindalap_ has joined #archiveteam-bs
23:08 ^🔗		lindalap has quit IRC (Read error: Connection reset by peer)
23:08 ^🔗		lindalap_ is now known as lindalap
23:44 ^🔗		godane has quit IRC (Ping timeout: 260 seconds)
23:50 ^🔗		Mateon1 has quit IRC (Ping timeout: 252 seconds)
23:50 ^🔗		Mateon1 has joined #archiveteam-bs

irclogger-viewer