#archiveteam-bs 2017-02-11,Sat

↑back Search

Time	Nickname	Message
00:13 ^🔗		Stil3tt0 is now known as Stiletto
00:14 ^🔗		GE has quit IRC (Remote host closed the connection)
01:00 ^🔗		__sagitai has joined #archiveteam-bs
01:05 ^🔗		_sagitair has quit IRC (Ping timeout: 370 seconds)
01:15 ^🔗		db420 is now known as dboard
01:16 ^🔗		icedice2 has joined #archiveteam-bs
01:17 ^🔗		icedice has quit IRC (Read error: Connection reset by peer)
01:20 ^🔗		rocode_ has joined #archiveteam-bs
01:27 ^🔗		rocode has quit IRC (Ping timeout: 246 seconds)
01:27 ^🔗		rocode_ is now known as rocode
01:30 ^🔗		kristian_ has quit IRC (Quit: Leaving)
02:29 ^🔗		ndiddy has quit IRC (Read error: Connection reset by peer)
02:41 ^🔗		_sagitair has joined #archiveteam-bs
02:47 ^🔗		__sagitai has quit IRC (Ping timeout: 370 seconds)
02:55 ^🔗		SchroSct has joined #archiveteam-bs
02:55 ^🔗	SchroSct	I made it!
02:58 ^🔗		schbirid2 has joined #archiveteam-bs
03:03 ^🔗		schbirid has quit IRC (Read error: Operation timed out)
03:29 ^🔗		odemg has joined #archiveteam-bs
03:37 ^🔗		pizzaiolo has left
04:43 ^🔗		NONSS has joined #archiveteam-bs
04:48 ^🔗		Nons has quit IRC (Read error: Operation timed out)
05:08 ^🔗		VADemon has quit IRC (Quit: left4dead)
05:18 ^🔗		icedice2 has quit IRC (Quit: Leaving)
05:28 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
05:34 ^🔗		Sk1d has joined #archiveteam-bs
05:42 ^🔗		User405 has joined #archiveteam-bs
05:43 ^🔗		User404 has quit IRC (Read error: Connection reset by peer)
06:22 ^🔗		GE has joined #archiveteam-bs
06:33 ^🔗		unkn0wn_ has quit IRC ()
07:05 ^🔗		Aranje has quit IRC (Quit: Three sheets to the wind)
07:21 ^🔗		GE has quit IRC (Remote host closed the connection)
07:32 ^🔗		odemg has quit IRC (Remote host closed the connection)
07:33 ^🔗		odemg has joined #archiveteam-bs
07:48 ^🔗	joepie91	the gory details of why gitlab failed: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
07:48 ^🔗	joepie91	(very good write-up)
08:35 ^🔗		paparus has joined #archiveteam-bs
08:36 ^🔗	Sanqui	okay so, i'd say we're interested
08:36 ^🔗	Sanqui	(though interest is pretty much defined by the number of people willing to help)
08:37 ^🔗	Sanqui	search is of course an achilles heel
08:37 ^🔗	paparus	I think the problem here is that specialized enumeration is needed
08:37 ^🔗	paparus	for each site
08:37 ^🔗	Sanqui	but: is all the important data available on GET endpoints, like how you linked http://courtindex.sdcourt.ca.gov/CISPublic/casedetail?casenum=SCA153865&casesite=SD&applcode=C
08:37 ^🔗	Sanqui	?
08:37 ^🔗	paparus	no
08:38 ^🔗	paparus	it depends on the specific site
08:38 ^🔗	Sanqui	ah
08:38 ^🔗		namespace has joined #archiveteam-bs
08:38 ^🔗	paparus	that's just an example
08:38 ^🔗	paparus	what would the result be on archive.org?
08:39 ^🔗	Sanqui	we have archivebot, which allows for websites to be archived and absorbed into the wayback machine
08:39 ^🔗	paparus	but there is no link structure leading to this specific page
08:39 ^🔗	Sanqui	so my idea was that the searches could be scraped locally in order to gather the URLs, then those would be put into archivebot
08:40 ^🔗	Sanqui	so wayback wouldn't have the search but would have the detail pages
08:40 ^🔗	paparus	is the data in the wayback machine full text searchable even if there is no link structure?
08:40 ^🔗	Sanqui	i believe there are plans for that
08:40 ^🔗	Sanqui	and either way, the entire collection could be downloaded
08:41 ^🔗	paparus	also some results will not even have a unique url, it would be a result of some cgi script
08:41 ^🔗	paparus	on another site
08:41 ^🔗	Sanqui	yeah, that's a bigger issue
08:42 ^🔗	Sanqui	realistically, the best thing you could do right now is to start a wiki article with a list of different sites, url structures, and requirements
08:42 ^🔗	paparus	ok, let me think this over
08:43 ^🔗	paparus	would the archive.org have problems with this type of information?
08:43 ^🔗	paparus	I mean it has some personal names and stuff, but it's all public
08:43 ^🔗	Sanqui	i (and the majority of archive team) don't speak for archive.org
08:43 ^🔗	paparus	ok, but in your opinion?
08:44 ^🔗	paparus	like I've done some research and there were cases where people got in trouble for similar stuff
08:44 ^🔗	paparus	for instance this is a similar case: https://www.reddit.com/r/Denmark/comments/42w67s/i_am_the_person_who_made_tingbogenstatistikorg/
08:45 ^🔗	Sanqui	it's government websites, i think currently those are not just accepted but welcomed. personally, i don't have enough of a conscience to know what sort of data is present and what dangers it could pose to people
08:45 ^🔗	paparus	a guy crawled the danish property registry, and published a site online
08:45 ^🔗	paparus	with the data
08:45 ^🔗	paparus	but the danish apparently have a thing called address protection where you register to have your address not showing in the registry for some time
08:46 ^🔗	paparus	but when he crawled it it was still showing
08:46 ^🔗	paparus	so it cause a piss storm in denmark and he had to bring the site down
08:46 ^🔗	Sanqui	i see
08:47 ^🔗	namespace	Well.
08:47 ^🔗	Sanqui	yeah, well, i remember news stories saying you can be sued just for accessing a website that wasn't "supposed" to be public, so
08:47 ^🔗	namespace	In the US, protected information basically isn't a thing AFAIK.
08:47 ^🔗	namespace	The access thing can be an issue depending on interpretation of the CFAA.
08:47 ^🔗	namespace	But meh.
08:47 ^🔗	namespace	ArchiveTeam deals with that all the time.
08:48 ^🔗	namespace	As does IA. Whether IA wants to host the info just on decency grounds is a different story though.
08:49 ^🔗	Sanqui	i think you could page somebody from IA with an exact description of what's to be uploaded.
08:50 ^🔗	paparus	do you have a contact?
08:50 ^🔗	Sanqui	SketchCow
08:51 ^🔗	paparus	ok, I'll try
08:56 ^🔗	joepie91	paparus: Sanqui: there are plans for full-text search, but archive as if there aren't
08:56 ^🔗	joepie91	it's quite likely to take quite some time before it appears
08:56 ^🔗	joepie91	I'd imagine that stuff like the Canada backup is higher-priority right now
08:56 ^🔗	joepie91	and full-text search on a dataset of this magnitude is really expensive
08:56 ^🔗	joepie91	(ie. it's likely a question of resources, not of tech)
08:57 ^🔗	Sanqui	fair
08:58 ^🔗	Sanqui	search or not, 1. it'd be in wayback, 2. warcs would be up for download; somebody could make their own site with fulltext search if desired
09:07 ^🔗	paparus	I am reading the comments on the danish guy's website and apparently was faster and better than the gov site it crawled
09:08 ^🔗	paparus	the gov site only had search by address but he added a full text search including by name
09:08 ^🔗	paparus	that's government for you
09:14 ^🔗		paparus has left
09:15 ^🔗		paparus has joined #archiveteam-bs
09:17 ^🔗	paparus	was archive.org ever sued for violation website TOS?
10:22 ^🔗		GE has joined #archiveteam-bs
10:50 ^🔗		__sagitai has joined #archiveteam-bs
11:02 ^🔗		_sagitair has quit IRC (Read error: Operation timed out)
11:26 ^🔗		GE has quit IRC (Remote host closed the connection)
11:50 ^🔗		odemg has quit IRC (Remote host closed the connection)
12:04 ^🔗		icedice has joined #archiveteam-bs
12:06 ^🔗		odemg has joined #archiveteam-bs
12:32 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
12:38 ^🔗		pizzaiolo has joined #archiveteam-bs
12:41 ^🔗	SchroSct	archive.org isn't a user, how could they?
12:47 ^🔗	SchroSct	it should be noted that I was on an intercept path with Nyany until we ran out of work.
13:05 ^🔗		GE has joined #archiveteam-bs
13:09 ^🔗	godane	so i have about 215 more episodes to go with Tech News Today
13:09 ^🔗	godane	i feel alot better now with that collection
13:13 ^🔗		yan has quit IRC (Quit: leaving)
13:39 ^🔗		BiggieJon has quit IRC (Quit: Page closed)
13:44 ^🔗	godane	i'm uploading the nhk world radio japan english news
13:44 ^🔗	godane	for 2017-01
15:06 ^🔗		VADemon has joined #archiveteam-bs
15:10 ^🔗		icedice has quit IRC (Quit: Leaving)
15:17 ^🔗		SmileyG has quit IRC (Ping timeout: 250 seconds)
15:19 ^🔗		VADemon has quit IRC (Quit: left4dead)
15:50 ^🔗		Aranje has joined #archiveteam-bs
15:50 ^🔗		odemg has quit IRC (Remote host closed the connection)
16:09 ^🔗		VADemon has joined #archiveteam-bs
16:09 ^🔗		odemg has joined #archiveteam-bs
16:15 ^🔗		odemg has quit IRC (Remote host closed the connection)
16:22 ^🔗		odemg has joined #archiveteam-bs
16:35 ^🔗		odemg has quit IRC (Remote host closed the connection)
16:36 ^🔗		odemg has joined #archiveteam-bs
16:44 ^🔗		icedice has joined #archiveteam-bs
16:47 ^🔗		pizzaiolo has quit IRC (Read error: Connection reset by peer)
16:48 ^🔗		pizzaiolo has joined #archiveteam-bs
16:48 ^🔗		pizzaiol1 has joined #archiveteam-bs
16:49 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
16:49 ^🔗		pizzaiol1 has quit IRC (Remote host closed the connection)
16:49 ^🔗		pizzaiolo has joined #archiveteam-bs
17:00 ^🔗		odemg has quit IRC (Remote host closed the connection)
17:05 ^🔗		odemg has joined #archiveteam-bs
17:35 ^🔗		icedice2 has joined #archiveteam-bs
17:38 ^🔗		icedice has quit IRC (Ping timeout: 260 seconds)
17:39 ^🔗		ItsYoda has quit IRC (Ping timeout: 260 seconds)
17:44 ^🔗		ItsYoda has joined #archiveteam-bs
17:58 ^🔗		odemg has quit IRC (Remote host closed the connection)
18:14 ^🔗		Smiley has joined #archiveteam-bs
18:31 ^🔗		ItsYoda has quit IRC (Ping timeout: 260 seconds)
18:32 ^🔗		GE has quit IRC (Remote host closed the connection)
18:38 ^🔗		ItsYoda has joined #archiveteam-bs
18:41 ^🔗		GE has joined #archiveteam-bs
18:43 ^🔗		odemg has joined #archiveteam-bs
19:01 ^🔗	odemg	178.62.61.231/ytglitch.mp4
19:05 ^🔗	Aoede	https://www.youtube.com/watch?v=9E6dWfVwFCI
19:10 ^🔗		Muad-Dib has quit IRC (Ping timeout: 260 seconds)
19:22 ^🔗		ItsYoda has quit IRC (Ping timeout: 260 seconds)
19:25 ^🔗		ItsYoda has joined #archiveteam-bs
19:33 ^🔗		Muad-Dib has joined #archiveteam-bs
20:08 ^🔗		Stiletto has quit IRC (Ping timeout: 250 seconds)
20:09 ^🔗		odemg has quit IRC (Remote host closed the connection)
20:42 ^🔗		odemg has joined #archiveteam-bs
20:49 ^🔗		bsmith093 has quit IRC (Remote host closed the connection)
20:50 ^🔗	SchroSct	is there a team to get pewdiepie to negative?
20:50 ^🔗		odemg has quit IRC (Remote host closed the connection)
20:52 ^🔗		bsmith093 has joined #archiveteam-bs
21:03 ^🔗		kristian_ has joined #archiveteam-bs
21:04 ^🔗		ndiddy has joined #archiveteam-bs
21:13 ^🔗	kristian_	hi all
21:14 ^🔗	kristian_	can I do something so that a website is archived in full regularly?
21:14 ^🔗	xmc	not with our existing tools, but you're welcome to make new tools
21:15 ^🔗	xmc	how big of a site, what is it, how often?
21:15 ^🔗	kristian_	xmc, I can barely code ;)
21:15 ^🔗	kristian_	http://starwarsmesse.dk/
21:15 ^🔗	kristian_	I'm thinking ... once every 60 days or so
21:16 ^🔗	rocode	kristian_, I do something similar with several websites, where I will archive them every 30 days. You can use grab-site and a cron job.
21:16 ^🔗	kristian_	ah, rocode ... I see
21:18 ^🔗	Sanqui	kristian_: the website is tiny. you can could by #archivebot every 60 days yourself and ask for it to be archived :p
21:18 ^🔗	Sanqui	err
21:18 ^🔗	Sanqui	you could stop by #archivebot*
21:18 ^🔗	kristian_	thanks, Sanqui ... will look into that
21:19 ^🔗	kristian_	it's quite small, yes ... and the genius webmaster (me) tried to make it future proof ;)
21:30 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
21:35 ^🔗		odemg has joined #archiveteam-bs
21:36 ^🔗		Stil3tt0 has joined #archiveteam-bs
21:46 ^🔗		pizzaiolo has quit IRC (Read error: Connection reset by peer)
21:48 ^🔗		pizzaiolo has joined #archiveteam-bs
21:52 ^🔗		dashcloud has joined #archiveteam-bs
22:01 ^🔗		icedice2 has quit IRC (Quit: Leaving)
22:13 ^🔗	dashcloud	kristian_: make sure if you are using a robots.txt file it doesn't block the Internet Archive crawler (ia_archiver I believe)
22:16 ^🔗	kristian_	hurm ... the archiving does not show up here: http://web.archive.org/web/*/starwarsmesse.dk
22:17 ^🔗	kristian_	I can't see a robots.txt: http://starwarsmesse.dk/robots.txt
22:20 ^🔗	Frogging	what doesn't show up?
22:21 ^🔗	Frogging	I see snapshots there
22:21 ^🔗	Frogging	such ast this one http://web.archive.org/web/20170204114255/http://www.starwarsmesse.dk/
22:21 ^🔗	VADemon	Wayback's robots.txt parser is insanely broken or outdated - whatever you call it.
22:21 ^🔗	Frogging	there's no robots issue here however
22:21 ^🔗	VADemon	just in case. e.g. whitelisting it won't actually "allow" the access
22:23 ^🔗	kristian_	Frogging, I requested an archiving about an hour ago
22:26 ^🔗	Frogging	stuff from archivebot won't instantly show up in wayback
22:26 ^🔗	Frogging	it takes time. days at least, I think
22:26 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
22:26 ^🔗	kristian_	thanks, Frogging ... I'll check in in a few days
22:27 ^🔗	Frogging	archivebot isn't the IA, it just uploads there ultimately
22:27 ^🔗	Frogging	:)
22:36 ^🔗	SchroSct	neat, how deep does it crawl?
22:39 ^🔗	Frogging	infinitely (on the specified domain) unless you tell it not to
22:47 ^🔗		pizzaiolo has quit IRC (Ping timeout: 506 seconds)
22:53 ^🔗		BlueMaxim has joined #archiveteam-bs
23:01 ^🔗	kristian_	much swifter than the waybackmachine interface, though
23:21 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
23:24 ^🔗		Stil3tt0 has quit IRC (Read error: Operation timed out)
23:30 ^🔗		GE has quit IRC (Remote host closed the connection)
23:34 ^🔗		kristian_ has quit IRC (Quit: Leaving)

irclogger-viewer