#archiveteam-bs 2018-05-04,Fri

↑back Search

Time Nickname Message
00:08 🔗 Gfy has quit IRC (Read error: Operation timed out)
00:22 🔗 Gfy has joined #archiveteam-bs
01:51 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
01:52 🔗 Mateon1 has joined #archiveteam-bs
02:03 🔗 m007a83_ has joined #archiveteam-bs
02:07 🔗 m007a83__ has quit IRC (Read error: Operation timed out)
02:07 🔗 m007a83 has joined #archiveteam-bs
02:11 🔗 m007a83_ has quit IRC (Read error: Operation timed out)
02:42 🔗 m007a83_ has joined #archiveteam-bs
02:46 🔗 m007a83 has quit IRC (Read error: Operation timed out)
02:54 🔗 dashcloud has joined #archiveteam-bs
02:59 🔗 godane so i'm at 7k items so far this month
03:39 🔗 BlueMax has joined #archiveteam-bs
04:06 🔗 odemg has quit IRC (Read error: Operation timed out)
04:11 🔗 odemg has joined #archiveteam-bs
04:31 🔗 m007a83_ is now known as m007a83
05:18 🔗 ReimuHaku has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
05:26 🔗 sun_shine has joined #archiveteam-bs
05:28 🔗 sun_shine hi. I'm curious if anyone around has used public hosts files or ABP style filters in web crawling to avoid fetching a billion redundant trackers since this has been an issue in my own private crawls
05:31 🔗 sun_shine I ended up making a plugin for grab-site that does something along these lines instead of making thousands of regex filters to check every link against. wondering if anyone out there has done the same because I'm not great at scripting
05:34 🔗 Selavi has quit IRC (Quit: verb. to stop or discontinue)
05:42 🔗 Selavi has joined #archiveteam-bs
06:06 🔗 namespace has quit IRC (Quit: Quit)
07:48 🔗 Mayonaise has quit IRC (Read error: Operation timed out)
07:48 🔗 schbirid has joined #archiveteam-bs
07:55 🔗 JAA sun_shine: We're filtering out a few trackers in ArchiveBot in the global ignore set (which should also be in grab-site). Everything else is grabbed by default. Doesn't cause many issues in my experience.
07:56 🔗 JAA Note that I distinguish between trackers and ads. Trackers are mostly useless in the context of web archival, but in my opinion, we should generally grab ads since they're part of what the user sees when visiting a web page.
07:56 🔗 sun_shine The size difference is minimal but the number of connections was lower
07:57 🔗 Mayonaise has joined #archiveteam-bs
07:57 🔗 JAA I was more thinking in terms of number of URLs grabbed.
07:57 🔗 sun_shine I looked at python libraries for parsing adblock plus filters and decided my criteria were similar.. i'm more concerned about nuisance hosts than nuisance content
08:58 🔗 sun_shine has quit IRC (Quit: Leaving)
09:48 🔗 wp494_ has joined #archiveteam-bs
09:52 🔗 wp494 has quit IRC (Ping timeout: 252 seconds)
10:00 🔗 BlueMax has quit IRC (Leaving)
11:29 🔗 ebel I think grab-site has some sort of igset? ignore sets. And the default does include some ad/tracker stuff? I think
11:31 🔗 ebel ION I am happy that twitter's mobile site can work entirely without JS, which means I can archive things without needing phantomjs, making my crawls quicker, and less likely to have oodles of JS & tracker things.
11:34 🔗 JAA Even the desktop site can be archived reasonably well without PhantomJS. Browsing the archives still requires JS though, I believe.
11:42 🔗 ebel yeah, you get something that you can read (FSVO read).
11:42 🔗 ebel My goal is for something that you could take a screenshot of years later and it looks "normal".
11:43 🔗 ebel Plain ol' wpull is working well for me.
11:44 🔗 JAA That should probably work with the desktop site (unless browsers suddenly break support for HTML or JS features, in which case you'd have to dig out an old browser version).
11:45 🔗 JAA But yeah, the mobile site is probably better for it. Unfortunately, it's not what's linked in most parts of the web, and many people probably don't even know about it. So for playback in the Wayback Machine, we still need to grab the desktop site.
11:47 🔗 ebel Sure. My goal isn't to make something that fixes linkrot for the web, but to archive things that were said on twitter. I'm OK with someone (me?) having to manually dig things out later
11:47 🔗 JAA Alright.
11:48 🔗 JAA Remind me, this is about the upcoming elections in Ireland, right?
11:51 🔗 ebel One referendum. In ~3 weeks. about abortion.
11:52 🔗 JAA Ah
11:52 🔗 ebel There was a same-sex marriage one about 3 years ago. And about a year ago I noticed one of the campaign groups deactived the facebook account.
11:53 🔗 ebel Probably afraid that in ~10 years time the comment from some aspiring politician might be dug up and embarassed. :P
11:53 🔗 JAA Well, public archives would be good for something like this. So if you only intend to grab the mobile Twitter page, please let me know which accounts need to be preserved, and I'll cast a spell.
11:57 🔗 ebel OK. :) I'm building a list. Some of these groups have been around since ~90s, and their websites are still in the internet archive. They haven't figured out the robots.txt thing. So that's good
11:58 🔗 JAA Thanks :-) It might also be a good idea to regrab those old websites since they might have content specific to this referendum etc.
11:59 🔗 ebel I've downloaded all the historic versions from IA ;)
12:00 🔗 ebel I use a FF extension for archiving a page to IA & archive.is, which I use sometimes. I hope that tells IA "hey here's a site you might like"
12:06 🔗 HCross has quit IRC (Read error: Operation timed out)
12:07 🔗 HCross has joined #archiveteam-bs
12:34 🔗 HCross has quit IRC (Read error: Connection reset by peer)
12:39 🔗 HCross has joined #archiveteam-bs
12:56 🔗 mistym has quit IRC (Quit: ZNC - http://znc.in)
13:05 🔗 mistym has joined #archiveteam-bs
13:11 🔗 HCross_ has joined #archiveteam-bs
13:16 🔗 HCross has quit IRC (Read error: Operation timed out)
13:16 🔗 HCross_ is now known as HCross
13:22 🔗 lindalap_ has joined #archiveteam-bs
13:22 🔗 lindalap has quit IRC (Read error: Connection reset by peer)
13:22 🔗 lindalap_ is now known as lindalap
14:28 🔗 ReimuHaku has joined #archiveteam-bs
14:52 🔗 ebel Ung. I've a problem with wpull. I think it is not downloading page requisites when you specify --recursive... :(
14:52 🔗 ebel (I have --span-hosts-allow page-requisites)
14:54 🔗 JAA --page-requisites
14:58 🔗 ebel yes I have that too. :)
14:59 🔗 ebel If I change the command line, to only remove "--level 1 --recursive" flags, then I get page reqs (obv just for the one URL that I tell it to download)
15:11 🔗 icedice has joined #archiveteam-bs
15:13 🔗 ebel It does dl some page reqs, like images??
15:38 🔗 icedice2 has joined #archiveteam-bs
15:42 🔗 icedice has quit IRC (Ping timeout: 252 seconds)
15:59 🔗 icedice2 has quit IRC (Quit: Leaving)
16:00 🔗 godane has quit IRC (Read error: Operation timed out)
16:15 🔗 godane has joined #archiveteam-bs
16:15 🔗 svchfoo3 sets mode: +o godane
16:31 🔗 m007a83 has quit IRC (Quit: Leaving)
16:33 🔗 Mateon1 has quit IRC (Ping timeout: 633 seconds)
16:33 🔗 Mateon1 has joined #archiveteam-bs
16:45 🔗 m007a83 has joined #archiveteam-bs
17:11 🔗 jschwart has joined #archiveteam-bs
17:11 🔗 slyphic has quit IRC (Quit: leaving)
17:11 🔗 slyphic has joined #archiveteam-bs
18:14 🔗 m007a83 has quit IRC (Ping timeout: 252 seconds)
18:26 🔗 m007a83 has joined #archiveteam-bs
18:30 🔗 m007a83 has quit IRC (Client Quit)
19:21 🔗 m007a83 has joined #archiveteam-bs
20:35 🔗 nyaomi has quit IRC (Read error: Connection reset by peer)
20:39 🔗 nyaomi has joined #archiveteam-bs
21:09 🔗 wp494_ is now known as wp494
21:20 🔗 wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
21:20 🔗 wp494 has joined #archiveteam-bs
22:18 🔗 lindalap has quit IRC (Read error: Connection reset by peer)
22:19 🔗 lindalap has joined #archiveteam-bs
22:21 🔗 schbirid has quit IRC (Quit: Leaving)
22:23 🔗 BlueMax has joined #archiveteam-bs
22:53 🔗 RichardG_ has joined #archiveteam-bs
22:53 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
22:58 🔗 jschwart has quit IRC (Quit: Konversation terminated!)
23:08 🔗 lindalap_ has joined #archiveteam-bs
23:08 🔗 lindalap has quit IRC (Read error: Connection reset by peer)
23:08 🔗 lindalap_ is now known as lindalap
23:44 🔗 godane has quit IRC (Ping timeout: 260 seconds)
23:50 🔗 Mateon1 has quit IRC (Ping timeout: 252 seconds)
23:50 🔗 Mateon1 has joined #archiveteam-bs

irclogger-viewer